|Up: Martijn's Homepage|
|Prev: PostgreSQL stuff||Next: An rsync-able gzip|
If you prefer to read it as straight text, you can find it here. Also, some time in the near future, most of this information will be merged into the Linux 2.4 Routing HOWTO
Until then this is a fairly good run-down of how Linux Packet Shaping works.
Linux 2.2 Packet Shaping HOWTO Martijn van Oosterhout, firstname.lastname@example.org v0.1, 25 Mar 2000 This document aims to help people discover how to configure and use the packet shaping capabilities of the Linux 2.2 kernel. Table of Contents 1. Introduction 2. Disclaimer 3. Related Documentation. 3.1. Feedback. 4. Overview of packet shaping. 5. The programs 5.1. Other requirements 6. Using 'tc' 6.1. Manipulating qdiscs 6.2. Manipulating classes 6.3. Manipulating filters 7. Types of schedulers 8. Types of filters 9. Examples of usage 9.1. Using the "fw" filter 9.2. Using the "route" filter 9.3. Using the "u32" filter 10. Copyright message 11. Acknowledgements 1. Introduction The is the sum total of my discovered knowledge about the Linux 2.2 Packet Shaping code. I created it because it was hard to discover and the amount of documentation is surprisingly small and the examples are not very good. I conceed that from the programming point of view there do exist documents that describe how the system works, but none of them from the user's point of view. Any corrections or suggestions are welcome. The packet shaping code was mainly written by Alexey Kuznetsov, <email@example.com>. He has an FTP site contain up-to-date versions of the iproute2 software required to manipulate the packet shaping modules. 2. Disclaimer I do not and cannot know everything there is to know about the Linux network software. Please accept and be warned that this document probably does contain errors. Please read any README files that are included with any of the various pieces of software described in this document for more detailed and accurate information. I will attempt to keep this document as error-free and up-to-date as possible. Versions of software are current as at time of writing. In no way do I or the authors of the software in this document offer protection against your own actions. If you configure this software, even as described in this document and it causes problems on your network then you alone must carry the responsibility. 3. Related Documentation. This document presumes you understand how to build a Linux kernel with the appropriate networking options selected and that you understand how to use the basic network tools such as ifconfig and route. If you do not, then you should read the NET-3-HOWTO <NET-3-HOWTO.html> in conjunction with this document as it describes these. For a closer to the kernel look at Packet Shaping and QoS in general, see the Linux-QoS-HOWTO available at: http://qos.ittc.ukans.edu/howto/index.html For more information of Netlink Sockets you can go here: http://qos.ittc.ukans.edu/netlink/html For a HTMLised version of the iproute2+tc notes, see here: http://snafu.freedom.org/linux2.2/iproute-notes.html 3.1. Feedback. Please send any comments, updates, or suggestions to me, <firstname.lastname@example.org>. The sooner I get feedback, the sooner I can update and correct this document. If you find any problems with it, please mail me directly as I can miss info posted to mailing lists. 4. Overview of packet shaping. Here is a useful comment from the include/linux/pkt_sched.h file in the Linux kernel source: /* "Handles" --------- All the traffic control objects have 32bit identifiers, or "handles". They can be considered as opaque numbers from user API viewpoint, but actually they always consist of two fields: major and minor numbers, which are interpreted by kernel specially, that may be used by applications, though not recommended. F.e. qdisc handles always have minor number equal to zero, classes (or flows) have major equal to parent qdisc major, and minor uniquely identifying class inside qdisc. */ Handles are written as major:minor. If either are left out they are assumed to be 0. In some cases 0 is special and it cannot be used as a major number. The numbers are actually hexadecimal so you can use ABC as your handle. Thus 3AB: and 45:B are both valid handles. Using :n as a handle does not work in all cases. You can consider the packet shaping code to be a huge array of filters, with the major numbers (from 1-FFFE) going down and the minor numbers (the classes, 0-FFFF) going across. A qdisc is assigned to a whole major and within that each class may have it's own settings. There is one of these tables for each device. One of these qdiscs is attached to the root of the device. All packets that go out of the device start at the zero-th class (column) in this qdisc (row). To traverse between the nodes are filters. Each node may have any number of filters and they each have a priority. The packet is tested against each of the filters in order and if one matches it moves to the target. If there are no filters attached to the current node or all the filters fail, then the packet is queued on that node. The node will then queue the packet and depending on the qdisc selected and the parameters set, it may send the packet straight away, queue it for later or drop it altogether. Each class also has a parent. This parent seems to have various meanings. For the CBQ scheduler, it specifies the class it may steal bandwidth from if it exceeds it's own. When a class is deleted, all it's children are deleted also, even (I believe) if they are referenced by filters. Note that it is possible to delete all the classes attached to a qdisc. I have not yet worked out how to delete such a qdisc. Note that not all qdiscs have classes. The CBQ (Class Based Queueing) does and there the child may borrow bandwidth from the parents. However, TBF (Token Bucket Filter) simply is and filters the traffic according to its rules. So only the major is used and no classes can be created. 5. The programs For manipulating the packet shaping modules you need the programs named 'tc' which is part of the iproute2 package. The current version is always available on Alexeys FTP site mentioned above. As of this writing the latest version is iproute2-2.2.4-now-ss000305.tar.gz. Most distributions come with it packaged but it is not generally installed automatically. All the testing for this document was done with the 991023-2 version of the Debian package 'iproute'. For using some of the filters you maybe need other programs to configure other parts of the networking code to set the appropriate flags. For example, for the 'fw' filter you will need the ipchains package and for the 'route' filter you will need the 'ip' command which is also part of the iproute2 package. The use of these commands is not covered here though examples will be given when appropriate. 5.1 Other requirements You will need to be able to compile your own kernel to create the necessary modules. In the kernel there is a whole menu under the option "QoS and/or fair queueing" which you will need most of. I generally compile all the listed modules as modules so I can play with them all at will. Module auto-loading for these options does work so if your modutils is configured correctly you won't even have to load these modules manually. Also, under the networking options you WILL need the CONFIG_RTNETLINK (Routing messages) option. It is hidden under the CONFIG_NETLINK (Kernel/User netlink socket) option in the Networking menu. This is true of the 'ip' tool as well. 6. Using 'tc' 'tc' can be a fairly hard to use program. The user space generally does a little bit of syntax checking and then sends it to the kernel. The kernel sends only a single integer back indicating success or failure. So unless you made a spelling mistake, your errors will generally be of the form: RTNETLINK answers: No such file or directory RTNETLINK answers: File exists RTNETLINK answers: Invalid argument The first generally means you referenced a handle that does not exist. The second generally means to tried to add something where the handle was already in use. The last is the catch-all error that generally means "Something went wrong". There is usually no indication of what and generally only much re-reading of help and trial-and-error will help you here. Part of the reason for writing this document is to save you such agony. 6.1. Manipulating qdiscs Qdiscs are added deleted and modified using the commands beginning with 'tc qdisc'. For example, to add a qdisc with major 1 as the root of device eth1, you would use the following command: tc qdisc add dev eth1 handle 1: root cbq bandwidth 100Mbit avpkt 1000 mpu 64 ------------ -------- --------- ---- --- ----------------------------------- 1 2 3 4 5 6 1. This part says you want to add a qdisc 2. The device to add it to. Required. 3. This is the handle (major) you want to give it. If you don't specify, one will be assigned for you, starting at around 8000 (remember, hex) and going up. The minor number must be zero. 4. This means to attach to the root of the device. Only one qdisc may be the root. Alternatives are "ingress" (for when a packet comes in) and "parent CLASS" (which is specifying the parent of this qdisc). Each qdisc can only be the parent of one other class, so the qdiscs form chains hanging off the "root" and "ingress" nodes. This field is required. 5. This field specifies the qdisc to attach to this major. There are many choices of schedulers to choose from. 6. This specifies the parameters used to inititialise the :0 class that is automatically created when you create the qdisc. See the help appropriate to that qdisc for more details. After executing the above command you have a CBQ qdisc on major 1 with a class with handle 1:0 using the CBQ parameters given. This class 1:0 is attached as the root of device eth1. To delete it again you merely have execute the command: tc qdisc del dev eth1 root Note how you only have to specify the fact that it is the root class. To delete a non-root qdisc you need only specify the parent. When a qdisc is deleted, all its constituent classes are also deleted. To view all the current defined classes, use the command: tc qdisc show [dev DEVICE] The device is optional. If omitted, all devices are listed. 6.2 Manipulating classes Once you have created the qdiscs, you probably want to add classes to your qdiscs to represent the various types of data you wish to shape. For example, with the above qdisc, to create a class that is restricted to only 2Mbit no matter what, you would use the following command: tc class add dev eth1 parent 1: classid 1:1 cbq bandwidth 2Mbit avpkt 1000 prio 1 rate 1Mbit maxburst 10 bounded isolated ------------ -------- --------- ----------- --- ------------------------------------------------------------------------- 1 2 3 4 5 6 1. This part indicates you wish to add a class. 2. The device to add it to. Required. 3. The parent of this class. This is different from the parent of a qdisc. This mainly indicates the major this class is in but it has special meanings for some schedulers. For example, for CBQ it is the class that this class may steal bandwidth from if required. This class will also be delete if its parent is deleted. Required. 4. Is the classid that you want this new class to be. Since the parent must always have the same major number as the class itself, you are allowed to leave off the major number and just put :n. Required. 5. The scheduling class of this class, must be the same as the parent qdisc. Required. 6. The parameters to the scheduling algorithm for this class. See the appropriate section for more information. That command creates a new class 1:1 whose parent is 1:0 as using CBQ with the given parameters. However, this class will not be used as it currently stands because it needs to have packets sent to it. Meanwhile, if you want to delete a class, you use the following command: tc class del dev eth1 classid 1:1 Only the classid is required to delete a class. You may not delete a class which is the parent of another class, you must delete the child classes first. It is usually faster to delete the root qdisc since that has the effect of deleting all the sub-classes. To show all the classes that belong to a particular device, use the following command: tc class show dev DEVICE Don't beleive the help when it says the device is optional. If you leave it out it doesn't work. Something that is quite useful is the -s switch when showing the classes. It lists each class together with the number of packets transmitted or dropped and other information about the current state of that class. 6.3 Manipulating filters Now that you've setup all your classes, you want for the data to be actually sent to the right classes. For this you need filters that match the data you want to shape. One of the simplest is the "fw" filter which filters on the basis of the mark attached by any part of the firewall. For example, the simplest such rule would be: tc filter add dev eth1 protocol ip parent 1: prio 1 handle 1 fw classid 1:1 ------------- -------- ----------- --------- ------ -------- -- ----------- 1 2 3 4 5 6 7 8 1. Indicates you want to add a filter 2. To device eth1. Required. 3. The protocol to filter. At least for this filter type it is required. 4. The parent is the class that this filter will be attached to. The filter will also be automatically deleted when the parent is. The parent must exist. This field is required. 5. Means that this filter will be checked before filters of priority greater than 1. The priority is optional and defaults to one. 6. This is the handle. The handle means different things for different types of filters. For the "fw" filter, it is the mark the packet must have gotten from the firewall code. The handle is required 7. Indicates that this is an "fw" type filter. Required. 8. This is a field type that is common to filters. It indicates the class to go to if the packet matches the filter. This field is required. The target need not exist yet. When you delete them, only the priority is required, though you will need to specify the parent if it is not the default which appears to be 1:0. tc filter del dev eth1 parent 1:0 prio 5 There is no command to list all the currently installed filters. However, the following command will list all filters attached to a particular node. If the optional bit is omitted, the root is listed. tc filter show dev DEVICE [parent CLASSID | root] Again, the device is not optional in this case. 7. Types of schedulers [Would list the various types of schedulers available, what the differences are.] In the kernel source in the net/sched directory there are source files named sch_*.c. These file are the source to the schedulers. Each of these files contains a large header comment describing how the filter works (though not how to configure it). 8. Types of filters [List the various types of filters, how to configure them and what they do.] 9. Examples of usage To demonstrate the various ways you can use the Packet Shaping code, I will setup a scenario and show various ways of doing it. Basically, through our interface eth1 there is a HostA behind a gateway HostB. The eth1 link is a 100Mbit interface but we want to limit all packets to that machine to 10Mbit. Here is will show two ways of doing this. 9.1. Using the "fw" filter The "fw" filter relies on the firewall tagging the packets to be shaped. So, first we will setup the firewall to tag them: ipchains -I output -d HostA -m 1 Now all packets to that machine are tagged with the mark 1. Now we build the packet shaping rules to actually shape the packets. First we build a CBQ class that covers the whole device to attach to the root. Note that the qdisc attached to the root should always cover the whole of the bandwidth of the device, or will simply lose the leftover bandwidth. tc qdisc add dev eth1 handle 1: root cbq bandwidth 100Mbit avpkt 1000 mpu 64 The avpkt represents the average packet size. 1000 is a good estimate. The mpu is the minimum packet size. 64 is usually used here. These are generally good defaults. Now we have a CBQ class covering all the traffic. Now we need to create the class for the data to that host. tc class add dev eth1 parent 1: classid 1:1 cbq bandwidth 10Mbit avpkt 1000 prio 1 rate 10Mbit bounded isolated The classes in CBQ have many more options. What this command basically does is create a class which is limited to 10Mbit and may not borrow bandwidth from any other class (bounded), nor may it lend bandwidth to other classes (isolated). Now we just need to indicate that we want the packets that are tagged with the mark 1 to go to class 1:1. This is accomplished with the command: tc filter add dev eth1 protocol ip parent 1:0 prio 1 handle 1 fw classid 1:1 This should be fairly self-explanatory. Attach to the 1:0 class a filter with priority 1 to filter all packet marked with 1 in the firewall to class 1:1. That's all there is to it! This is the (IMHO) easy way, the other ways are I think harder to understand. 9.2. Using the "route" filter This filter filters based on the results of the routing tables. When a packet that is traversing through the classes reaches one that is marked with the "route" filter, it splits the packets up based on information in the routing table. First, as above, we create the two traffic classes: tc qdisc add dev eth1 handle 1: root cbq bandwidth 100Mbit avpkt 1000 mpu 64 tc class add dev eth1 parent 1: classid 1:1 cbq bandwidth 10Mbit avpkt 1000 prio 1 rate 10Mbit bounded isolated From here on I'm going on the example given in the Advanced Linux Networking HOWTO. tc filter add dev eth1 parent 1:0 protocol ip prio 100 route Here we add a route filter onto the parent node 1:0 with priority 100. When a packet reaches this node (which, since it is the root, will happen immediately) it will consult the routing table and if one matches will send it to the given class and give it a priority of 100. Then, to finally kick it into action, you add the appropriate routing entry: ip route add HostA via HostB flow 1:1 (Strangely, though I think I've done everything in the example, this doesn't seem to work for me. I get an error that goes: Error: either "to" is duplicate, or "flow" is a garbage. Someone who knows will have to comment on this.) 9.3. Using the "u32" filter The "u32" filter is a filter that filter directly based on the contents of the packet. Thus it can filter based on source or destination addresses or ports. It can filter based on the TOS and other truly bizarre fields. It does this by taking a specification of the form [offset/mask/value] and applying that to all the packets. Fortunately you can use symbolic names much as with tcpdump. To begin with you create the classes as in the previous examples: tc qdisc add dev eth1 handle 1: root cbq bandwidth 100Mbit avpkt 1000 mpu 64 tc class add dev eth1 parent 1: classid 1:1 cbq bandwidth 10Mbit avpkt 1000 prio 1 rate 10Mbit bounded isolated Then you just add the "u32" filter to make it work. tc filter add dev eth1 parent 1:0 protocol ip prio 1 u32 match ip dst HostA flowid 1:1 That all there is to it. 10. Copyright message The Packet-Shaping-HOWTO, a guide to software supporting packet shaping for Linux. Copyright (c) 2000 Martijn van Oosterhout. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the: Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. 11. Acknowledgements [List various people] END
|Up: Martijn's Homepage|
|Prev: PostgreSQL stuff||Next: An rsync-able gzip|