Packet-Shaping-HOWTO

Up: Martijn's Homepage  
Prev: PostgreSQL stuff   Next: An rsync-able gzip  

If you prefer to read it as straight text, you can find it here. Also, some time in the near future, most of this information will be merged into the Linux 2.4 Routing HOWTO

Until then this is a fairly good run-down of how Linux Packet Shaping works.


Linux 2.2 Packet Shaping HOWTO
Martijn van Oosterhout, kleptog@svana.org
v0.1, 25 Mar 2000

  This document aims to help people discover how to configure and use the
  packet shaping capabilities of the Linux 2.2 kernel.

Table of Contents

  1.    Introduction
  2.    Disclaimer
  3.    Related Documentation.
  3.1.    Feedback.
  4.    Overview of packet shaping.
  5.    The programs
  5.1.    Other requirements
  6.    Using 'tc'
  6.1.    Manipulating qdiscs
  6.2.    Manipulating classes
  6.3.    Manipulating filters
  7.    Types of schedulers
  8.    Types of filters
  9.    Examples of usage
  9.1.    Using the "fw" filter
  9.2.    Using the "route" filter
  9.3.    Using the "u32" filter
  10.   Copyright message
  11.   Acknowledgements

1. Introduction

  The is the sum total of my discovered knowledge about the Linux 2.2 Packet
  Shaping code. I created it because it was hard to discover and the amount
  of documentation is surprisingly small and the examples are not very good.
  I conceed that from the programming point of view there do exist documents
  that describe how the system works, but none of them from the user's point
  of view. Any corrections or suggestions are welcome.

  The packet shaping code was mainly written by Alexey Kuznetsov,
  <kuznet@ms2.inr.ac.ru>. He has an FTP site contain up-to-date versions of
  the iproute2 software required to manipulate the packet shaping modules.

2.  Disclaimer

  I do not and cannot know everything there is to know about the Linux
  network software. Please accept and be warned that this document
  probably does contain errors. Please read any README files that are
  included with any of the various pieces of software described in this
  document for more detailed and accurate information. I will attempt to
  keep this document as error-free and up-to-date as possible. Versions
  of software are current as at time of writing.

  In no way do I or the authors of the software in this document offer
  protection against your own actions. If you configure this software,
  even as described in this document and it causes problems on your
  network then you alone must carry the responsibility.

3.  Related Documentation.

  This document presumes you understand how to build a Linux kernel with
  the appropriate networking options selected and that you understand
  how to use the basic network tools such as ifconfig and route.  If you
  do not, then you should read the NET-3-HOWTO <NET-3-HOWTO.html> in
  conjunction with this document as it describes these.

  For a closer to the kernel look at Packet Shaping and QoS in general, see
  the Linux-QoS-HOWTO available at:
  http://qos.ittc.ukans.edu/howto/index.html

  For more information of Netlink Sockets you can go here:
  http://qos.ittc.ukans.edu/netlink/html

  For a HTMLised version of the iproute2+tc notes, see here:
  http://snafu.freedom.org/linux2.2/iproute-notes.html

3.1.  Feedback.

  Please send any comments, updates, or suggestions to me,
  <kleptog@svana.org>. The sooner I get feedback, the sooner I can
  update and correct this document. If you find any problems with it, please
  mail me directly as I can miss info posted to mailing lists.

4.  Overview of packet shaping.

  Here is a useful comment from the include/linux/pkt_sched.h file in the
  Linux kernel source:

  /* "Handles"
     ---------

      All the traffic control objects have 32bit identifiers, or "handles".

      They can be considered as opaque numbers from user API viewpoint,
      but actually they always consist of two fields: major and
      minor numbers, which are interpreted by kernel specially,
      that may be used by applications, though not recommended.

      F.e. qdisc handles always have minor number equal to zero,
      classes (or flows) have major equal to parent qdisc major, and
      minor uniquely identifying class inside qdisc.
   */

  Handles are written as major:minor. If either are left out they are assumed
  to be 0. In some cases 0 is special and it cannot be used as a major number.
  The numbers are actually hexadecimal so you can use ABC as your handle. Thus
  3AB: and 45:B are both valid handles. Using :n as a handle does not work in
  all cases.

  You can consider the packet shaping code to be a huge array of filters, with
  the major numbers (from 1-FFFE) going down and the minor numbers (the
  classes, 0-FFFF) going across. A qdisc is assigned to a whole major and
  within that each class may have it's own settings. There is one of these
  tables for each device.

  One of these qdiscs is attached to the root of the device. All packets that
  go out of the device start at the zero-th class (column) in this qdisc
  (row). To traverse between the nodes are filters. Each node may have any
  number of filters and they each have a priority. The packet is tested
  against each of the filters in order and if one matches it moves to the
  target.

  If there are no filters attached to the current node or all the filters
  fail, then the packet is queued on that node. The node will then queue the
  packet and depending on the qdisc selected and the parameters set, it may
  send the packet straight away, queue it for later or drop it altogether.

  Each class also has a parent. This parent seems to have various meanings.
  For the CBQ scheduler, it specifies the class it may steal bandwidth from if
  it exceeds it's own. When a class is deleted, all it's children are deleted
  also, even (I believe) if they are referenced by filters. Note that it is
  possible to delete all the classes attached to a qdisc. I have not yet
  worked out how to delete such a qdisc.

  Note that not all qdiscs have classes. The CBQ (Class Based Queueing) does
  and there the child may borrow bandwidth from the parents. However, TBF
  (Token Bucket Filter) simply is and filters the traffic according to its
  rules. So only the major is used and no classes can be created.

5. The programs

  For manipulating the packet shaping modules you need the programs named 'tc'
  which is part of the iproute2 package. The current version is always
  available on Alexeys FTP site mentioned above. As of this writing the latest
  version is iproute2-2.2.4-now-ss000305.tar.gz. Most distributions come with
  it packaged but it is not generally installed automatically. All the testing
  for this document was done with the 991023-2 version of the Debian package
  'iproute'.

  For using some of the filters you maybe need other programs to configure
  other parts of the networking code to set the appropriate flags. For
  example, for the 'fw' filter you will need the ipchains package and for the
  'route' filter you will need the 'ip' command which is also part of the
  iproute2 package. The use of these commands is not covered here though
  examples will be given when appropriate.

5.1 Other requirements

  You will need to be able to compile your own kernel to create the necessary
  modules. In the kernel there is a whole menu under the option 
  "QoS and/or fair queueing" which you will need most of. I generally compile
  all the listed modules as modules so I can play with them all at will.

  Module auto-loading for these options does work so if your modutils is configured
  correctly you won't even have to load these modules manually.

  Also, under the networking options you WILL need the CONFIG_RTNETLINK
  (Routing messages) option. It is hidden under the CONFIG_NETLINK
  (Kernel/User netlink socket) option in the Networking menu. This is true
  of the 'ip' tool as well.

6. Using 'tc'

  'tc' can be a fairly hard to use program. The user space generally does a 
  little bit of syntax checking and then sends it to the kernel. The kernel
  sends only a single integer back indicating success or failure. So unless you
  made a spelling mistake, your errors will generally be of the form:

  RTNETLINK answers: No such file or directory
  RTNETLINK answers: File exists
  RTNETLINK answers: Invalid argument

  The first generally means you referenced a handle that does not exist. The second
  generally means to tried to add something where the handle was already in use.
  The last is the catch-all error that generally means "Something went wrong".
  There is usually no indication of what and generally only much re-reading of
  help and trial-and-error will help you here. Part of the reason for writing
  this document is to save you such agony.

6.1. Manipulating qdiscs

  Qdiscs are added deleted and modified using the commands beginning with 'tc
  qdisc'. For example, to add a qdisc with major 1 as the root of device eth1,
  you would use the following command:

  tc qdisc add dev eth1 handle 1: root cbq bandwidth 100Mbit avpkt 1000 mpu 64
  ------------ -------- --------- ---- --- -----------------------------------
       1          2         3       4   5                   6

  1. This part says you want to add a qdisc

  2. The device to add it to. Required.

  3. This is the handle (major) you want to give it. If you don't specify, one
     will be assigned for you, starting at around 8000 (remember, hex) and
     going up.  The minor number must be zero.

  4. This means to attach to the root of the device. Only one qdisc may be the
     root. Alternatives are "ingress" (for when a packet comes in) and "parent
     CLASS" (which is specifying the parent of this qdisc). Each qdisc can
     only be the parent of one other class, so the qdiscs form chains hanging
     off the "root" and "ingress" nodes. This field is required.

  5. This field specifies the qdisc to attach to this major.
     There are many choices of schedulers to choose from.

  6. This specifies the parameters used to inititialise the :0 class that is 
     automatically created when you create the qdisc. See the help appropriate
     to that qdisc for more details.

  After executing the above command you have a CBQ qdisc on major 1 with a class
  with handle 1:0 using the CBQ parameters given. This class 1:0 is attached as the
  root of device eth1. To delete it again you merely have execute the command:

  tc qdisc del dev eth1 root

  Note how you only have to specify the fact that it is the root class. To
  delete a non-root qdisc you need only specify the parent. When a qdisc is
  deleted, all its constituent classes are also deleted.

  To view all the current defined classes, use the command:

  tc qdisc show [dev DEVICE]

  The device is optional. If omitted, all devices are listed.

6.2 Manipulating classes

  Once you have created the qdiscs, you probably want to add classes to your qdiscs to
  represent the various types of data you wish to shape. For example, with the above qdisc,
  to create a class that is restricted to only 2Mbit no matter what, you would use the
  following command:

  tc class add dev eth1 parent 1: classid 1:1 cbq bandwidth 2Mbit avpkt 1000 prio 1 rate 1Mbit maxburst 10 bounded isolated
  ------------ -------- --------- ----------- --- -------------------------------------------------------------------------
        1          2        3          4       5                                   6

  1. This part indicates you wish to add a class.

  2. The device to add it to. Required.

  3. The parent of this class. This is different from the parent of a qdisc.
     This mainly indicates the major this class is in but it has special
     meanings for some schedulers. For example, for CBQ it is the class that
     this class may steal bandwidth from if required. This class will also be
     delete if its parent is deleted. Required.

  4. Is the classid that you want this new class to be. Since the parent must
     always have the same major number as the class itself, you are allowed to
     leave off the major number and just put :n. Required.

  5. The scheduling class of this class, must be the same as the parent qdisc.
     Required.

  6. The parameters to the scheduling algorithm for this class. See the
     appropriate section for more information.

  That command creates a new class 1:1 whose parent is 1:0 as using CBQ with
  the given parameters. However, this class will not be used as it currently
  stands because it needs to have packets sent to it. Meanwhile, if you want
  to delete a class, you use the following command:

  tc class del dev eth1 classid 1:1

  Only the classid is required to delete a class. You may not delete a class
  which is the parent of another class, you must delete the child classes
  first. It is usually faster to delete the root qdisc since that has the
  effect of deleting all the sub-classes.

  To show all the classes that belong to a particular device, use the
  following command:

  tc class show dev DEVICE

  Don't beleive the help when it says the device is optional. If you leave it
  out it doesn't work. Something that is quite useful is the -s switch when
  showing the classes. It lists each class together with the number of packets
  transmitted or dropped and other information about the current state of that
  class.

6.3 Manipulating filters

  Now that you've setup all your classes, you want for the data to be actually
  sent to the right classes. For this you need filters that match the data you
  want to shape. One of the simplest is the "fw" filter which filters on the
  basis of the mark attached by any part of the firewall. For example, the
  simplest such rule would be:

  tc filter add dev eth1 protocol ip parent 1: prio 1 handle 1 fw classid 1:1
  ------------- -------- ----------- --------- ------ -------- -- -----------
        1          2          3          4        5       6     7      8

  1. Indicates you want to add a filter

  2. To device eth1. Required.

  3. The protocol to filter. At least for this filter type it is required.

  4. The parent is the class that this filter will be attached to. The filter
     will also be automatically deleted when the parent is. The parent must
     exist. This field is required.

  5. Means that this filter will be checked before filters of priority greater
     than 1. The priority is optional and defaults to one.

  6. This is the handle. The handle means different things for different types
     of filters. For the "fw" filter, it is the mark the packet must have
     gotten from the firewall code. The handle is required

  7. Indicates that this is an "fw" type filter. Required.

  8. This is a field type that is common to filters. It indicates the class to
     go to if the packet matches the filter. This field is required. The
     target need not exist yet.

  When you delete them, only the priority is required, though you will need to
  specify the parent if it is not the default which appears to be 1:0.

  tc filter del dev eth1 parent 1:0 prio 5

  There is no command to list all the currently installed filters. However,
  the following command will list all filters attached to a particular node.
  If the optional bit is omitted, the root is listed.

  tc filter show dev DEVICE [parent CLASSID | root]

  Again, the device is not optional in this case.

7. Types of schedulers

  [Would list the various types of schedulers available, what the differences are.]

  In the kernel source in the net/sched directory there are source files named sch_*.c.
  These file are the source to the schedulers. Each of these files contains a large header
  comment describing how the filter works (though not how to configure it).

8. Types of filters

  [List the various types of filters, how to configure them and what they do.]

9. Examples of usage

  To demonstrate the various ways you can use the Packet Shaping code, I will
  setup a scenario and show various ways of doing it. Basically, through our
  interface eth1 there is a HostA behind a gateway HostB. The eth1 link is a
  100Mbit interface but we want to limit all packets to that machine to
  10Mbit. Here is will show two ways of doing this.

9.1. Using the "fw" filter

  The "fw" filter relies on the firewall tagging the packets to be shaped. So,
  first we will setup the firewall to tag them:

  ipchains -I output -d HostA -m 1

  Now all packets to that machine are tagged with the mark 1. Now we build the
  packet shaping rules to actually shape the packets. First we build a CBQ
  class that covers the whole device to attach to the root. Note that the
  qdisc attached to the root should always cover the whole of the bandwidth of
  the device, or will simply lose the leftover bandwidth.

  tc qdisc add dev eth1 handle 1: root cbq bandwidth 100Mbit avpkt 1000 mpu 64

  The avpkt represents the average packet size. 1000 is a good estimate. The
  mpu is the minimum packet size. 64 is usually used here. These are generally
  good defaults.

  Now we have a CBQ class covering all the traffic. Now we need to create the
  class for the data to that host.

  tc class add dev eth1 parent 1: classid 1:1 cbq bandwidth 10Mbit avpkt 1000 prio 1 rate 10Mbit bounded isolated

  The classes in CBQ have many more options. What this command basically does
  is create a class which is limited to 10Mbit and may not borrow bandwidth
  from any other class (bounded), nor may it lend bandwidth to other classes
  (isolated).

  Now we just need to indicate that we want the packets that are tagged with
  the mark 1 to go to class 1:1. This is accomplished with the command:

  tc filter add dev eth1 protocol ip parent 1:0 prio 1 handle 1 fw classid 1:1

  This should be fairly self-explanatory. Attach to the 1:0 class a filter
  with priority 1 to filter all packet marked with 1 in the firewall to
  class 1:1.

  That's all there is to it! This is the (IMHO) easy way, the other ways are
  I think harder to understand.

9.2. Using the "route" filter

  This filter filters based on the results of the routing tables. When a
  packet that is traversing through the classes reaches one that is marked
  with the "route" filter, it splits the packets up based on information in
  the routing table. First, as above, we create the two traffic classes:

  tc qdisc add dev eth1 handle 1: root cbq bandwidth 100Mbit avpkt 1000 mpu 64
  tc class add dev eth1 parent 1: classid 1:1 cbq bandwidth 10Mbit avpkt 1000 prio 1 rate 10Mbit bounded isolated

  From here on I'm going on the example given in the Advanced Linux Networking HOWTO.

  tc filter add dev eth1 parent 1:0 protocol ip prio 100 route

  Here we add a route filter onto the parent node 1:0 with priority 100. When
  a packet reaches this node (which, since it is the root, will happen
  immediately) it will consult the routing table and if one matches will send
  it to the given class and give it a priority of 100. Then, to finally kick
  it into action, you add the appropriate routing entry:

  ip route add HostA via HostB flow 1:1

  (Strangely, though I think I've done everything in the example, this doesn't
  seem to work for me. I get an error that goes: 

  Error: either "to" is duplicate, or "flow" is a garbage.

  Someone who knows will have to comment on this.)

9.3. Using the "u32" filter

  The "u32" filter is a filter that filter directly based on the contents of the
  packet. Thus it can filter based on source or destination addresses or
  ports. It can filter based on the TOS and other truly bizarre fields. It
  does this by taking a specification of the form [offset/mask/value] and
  applying that to all the packets. Fortunately you can use symbolic names much
  as with tcpdump.

  To begin with you create the classes as in the previous examples:

  tc qdisc add dev eth1 handle 1: root cbq bandwidth 100Mbit avpkt 1000 mpu 64
  tc class add dev eth1 parent 1: classid 1:1 cbq bandwidth 10Mbit avpkt 1000 prio 1 rate 10Mbit bounded isolated

  Then you just add the "u32" filter to make it work.

  tc filter add dev eth1 parent 1:0 protocol ip prio 1 u32 match ip dst HostA flowid 1:1

  That all there is to it.

10. Copyright message

  The Packet-Shaping-HOWTO, a guide to software supporting packet shaping
  for Linux.  Copyright (c) 2000 Martijn van Oosterhout.

  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; either version 2 of the License, or (at
  your option) any later version.

  This program is distributed in the hope that it will be useful, but
  WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
  General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the:

  Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139,
  USA.

11. Acknowledgements

  [List various people]

END


Up: Martijn's Homepage  
Prev: PostgreSQL stuff   Next: An rsync-able gzip  
By Martijn van Oosterhout (kleptog (at) svana.org)
Copyright © 2000-2006 - Last modified 29/03/2000