[120067] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Linux shaping packet loss

daemon@ATHENA.MIT.EDU (gordon b slater)
Wed Dec 9 00:48:18 2009

X-IP-MAIL-FROM: gordslater@ieee.org
From: gordon b slater <gordslater@ieee.org>
To: Chris <chris@ghostbusters.co.uk>
In-Reply-To: <20091209010747.GA20524@verge.net.au>
Date: Wed, 09 Dec 2009 05:47:22 +0000
Cc: nanog <nanog@nanog.org>
Reply-To: gordslater@ieee.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

Apologies to all on handheld devices. If you're not into BSD or Linux TC
operationally, skip this post. Due to my usual rambling narrative style
for "alternative" troubleshooting I was going to mail this direct to the
OP but I was persuaded AMBJ by a co-conspirator to post this to list in
full.
#

@all with similar "traffic shaping" problems Googling in the future:  

On Wed, 2009-12-09 at 12:07 +1100, Simon Horman wrote:
> but trying to use much
> more than 90% of the link capacity

......though not directly relevant in this case, for lower speed links
and things like xDSL to the CPE that 90% must include protocol overheads
(you are getting close to bone in that last 10%) and _much_ more
affective (<- that's A-ffective) things like actual modem "sync speed".
It depends how the TC is calc'ed/applied of course. Just a general note
for a more CPE-oriented occurence of this. So kids, if you're struggling
with your IPCOP in a SOHO shop with ADSL+PPPoE, this means you!


#### Meanwhile, back at our level.......

@all generally: do many of us use Linux TC at small-carrier level? I
know of a lot of BSD boxen out there that handle huge complex flows but
I suspect Linux kernel is less popular for this - or am I assuming
wrong? Personally I'd lean to BSD for big stuff and Linux on for CPE, am
I out of touch nowadays? 

#### Fully back on topic from here on....... 

@Chris - I've not used RED in any anger, sorry. Other than a typo in the
config for the affected queue (maybe an extra digit loose somewhere?),
things are definitely going to get complicated. 

Is something exceeding a tc bucket mtu occasionally? 


Chris <chris@ghostbusters.co.uk> wrote:
>
>My thoughts are that any dropped packets on the parent class is a bad
>thing:

yes, generally speaking, but.....

>
>qdisc htb 1: root r2q 10 default 265 direct_packets_stat 448 ver 3.17
> Sent 4652558768 bytes 5125175 pkt (dropped 819, overlimits 10048800
>requeues 0)
> rate 0bit 0pps backlog 0b 28p requeues 0

... in the above example, that loss rate is extremely low at 000.0159%
( 819 / 5125175 %) It may not be a representative sample, but I just
thought I'd check you hadn't dropped a few significant digits in a %loss
calc along the way :)  That level of loss if operationally insignificant
of course, especially for TCP.

As you are I'm sure aware, perfect TC through any box is pretty
specialist and usually unique to that placement. Without any graphical
output, queues and the like are extremely difficult to visualize
(mentally) under load (though for smaller boxes the RRD graphs in
pfSENSE are nicely readable - see below). 
Because of this I usually try to eliminate ~everything~ else before I
get into qdisks and the nitty-gritty. As a natural control fr/geek I've
wasted far to many hours stuck in the buckets to no real improvement in
many cases.

Chris <chris@ghostbusters.co.uk> wrote:
> I've isolated it to the egress HTB qdisc
>
good, though read on for a strange tale

You MUST make a distinction between TC dropping the packets and the
interface dropping the packets; I see in your later post a TC qdisc line
showing that tc itself had dropped packets, BUT it ALWAYS pays to check
at the same time (using ifconfig) that no packets are reported being
dropped by the interfaces as well. I've had 2 or 3 occasions where `TC
drops` were actually somehow linked to _interface_ drops and it really
threw me, we never did work out why. The interaction confounded us
totally.

IF the INTERFACES are ALSO dropping in ifconfig, THEN, and ONLY then,
you are into the lowest layer.


So, with that in mind and the sheer complexity of possibilities, here's
how I personally approach difficult BSD/Linux "TC problems". Note that I
have zero experience or inclination towards Cisco TC:

Kick the tyres!
A lot of people mentioned layer 2 link-config problems, but as far as I
can see, no-one has suggested quickly yanking the cables and blowing the
dust off the the ends. 
Whenever I have to reach for a calculator or pen for a problem, I first
swap out the interconnects to reduce the mental smoke ;)   

Next, I check the NICs to see if they're unseated (if applicable), or
CPU (think: rogue process - use top) or even bus utililisation if you
have only 32bit PCI NICs in a busy box.

Next. does the box do anything else like Snort/Squid/etc at the same
time?

To eliminate wierdness and speedup troubleshooing if TC is acting
strange I'd run tcpdump continually from the very start of my
troubleshooting, dumping into small 10MB-ish files - use the special -C
option ="split to filesize"  and the -W option to set about 100 files in
a ring buffer so that you have a decent history to go back through if
you need it, without clogging the fisystem of the box with TB or
packetdata :)
(splitting them into 10MB files at the start leads to fast analysis in
the shark, though you could carve up larger files manually I guess)

That way, if the TC hurts your brain run the dumps them through
wireshark's "expert info" filter while you have a coffee.
(Analylse>ExpertInfo I think?) It's just in case something external or
unusual is splattering the interfaces into confusion, it will only take
a minute or less to run this analysis with an "affected" dump, as 10MB
is very manageable and you can select the relevant dumpfile from it's
last access time. Don't waste any time viewing them manually, just a
glance. 
Remember to kill the tcpdumps when you find the problem though,
scrubbing the files if needed for compliance etc.

If you need to run tcpdump for a really long time I'd suggest setting it
up with setuid because it usually needs to run as root. Personally I get
nervous on important perimiter devices dumping during a coffeebreak ;) 

When I'm trying to get my head around flows through "foreign" Linux
boxen I tend to use "iftop" for a couple of minutes or so to just get
the feel of connections and throughputs actually travelling through it,
I sometimes run it over SSH continually on another monitor when dealing
with critical flow boxes that show problems.  If you throw a config
"switch" somewhere it's nice to see the effect visually, though be
careful, it runs as root so again don't leave going 24/7, just while you
are fiddling {cough} adjusting. Again, for longterm watching try to use
as setuid. 

I set up a few iperf flows to stimulate your TC setups or use netcat,
scp or similar to push some files through to /dev/null at the other end,
use "trickle" to limit the flow rates to realistic or less
operationally-damaging levels during testing. wfm. Adding 9 flows of
about 10% of link capacity each should give tc some thinking work to do
in an already active network, script it all to run for only a few
seconds at a time in pulses, rather than saturating the operational link
for an hour on end or the fone won't stop haha.
If your queues are all port-based,(depends how you're feeding flows into
the tc mechanism I suppose) set up "engineering test queues" on high
ports and force iperf to use these high ports while you test inline. If
the box isn't yet in service, this obviously isn't an issue. 

now,
IF there are NO drops reported by ifconfig or kernel messages, just
drops reported by the TC mechanism, it gets complicated. Only THEN do I
reach for a calculator (and I also print out the relevant man pages!):

But! There is one more rapid technique available you shouldn't ignore-
Swapouts:
---------
TC is hard to get perfect using any vendor, so eliminate the hardware
and configs in one swoop if you can!
If you feel like trying a swapout (not sure if `availability` will allow
in this case) a modern mobo running pfSENSE will allow a quick "have I
done something stupid on the Linux box?" comparison.  I suggest pfSENSE
because it has a reputation for fast throughput and {cringe} a "wizard"
in the web GUI so you can have it set up a set of TC rules rapidly. I'd
run it direct from the liveCD for a quick comparison, give it min. 256MB
ram and a P4 or higher for G-eth speeds with shaping but no ipsec. (this
is overkill, but you must eliminate doubt in this case) Allow 10 seconds
to swap the interconnects once it's config-ed though, this could be more
than you can allow for downtime? Dunno

Another `swapout` option, but actually a same-box alternative, is
setting up simple TC yourself manually using "tc" at the shell or a
(better) a simple script instead of %whatever-you-are-using-right-now%
(possibly a flat scripted config file for tc? or maybe some fancy custom
web-thingy?) Flattening the tc config this way for an couple of hours
can give a comparison though it all depends on the desired
availability/quality and if good shaping is essential 24/7 on a
saturated link.

Luckily, you hint that your hardware is significantly better than i686 I
think? If we knew more about the actual hardware and the flows through
it + a little about the adjacent topology, we could all offer some
hardware sizing comments in case you're pushing something over it's
limit.

Finally, I've seen more than a few examples of people using old P3-era
hardware for heavy duty throughput. It can work well (especially with
PCI-X) but NEVER assume that layer one ends at the RJ45. It goes inside
the case and a significant distance sometimes: all prone to
heat/dust/fluff/broken solder/physical alignment problems. In years gone
by, mis-seated AGP cards would take ethernet links down then up again on
hot days. In these roles, your old leaky PSU and mobo capacitors can
lead you on a merry dance for a l-o-n-g time.
Pained memories :)


regards,
Gord








home help back first fref pref prev next nref lref last post