[2796] in linux-net channel archive

home help back first fref pref prev next nref lref last post

Re: Tx TCP rates down > 20% - A report.

daemon@ATHENA.MIT.EDU (Avery Pennarun)
Sun May 5 11:55:33 1996

Date: 	Sat, 4 May 1996 12:23:05 -0400 (EDT)
From: Avery Pennarun <apenwarr@foxnet.net>
To: linux-net@vger.rutgers.edu
In-Reply-To: <Pine.LNX.3.91.960502064641.26122B-100000@linux.cs.Helsinki.FI>


On Thu, 2 May 1996, Linus Torvalds wrote:

> There _does_ seem to be some bad effects with the drivers under some
> circumstances, though. Notably, the "tbusy" handling in the ethernet driver
> interface looks like it's pretty broken - it's used for two things: (a)
> serializing the ethernet driver (which was the original reason for it, but is
> unnecessary these days when the network layer makes sure it's all serialized
> anyway) and (b) as a send throttle to tell the network layer that the 
> card is busy.
> 
> The (b) case is the only thing it does any more, and I suspect it is also 
> the thing that makes you see bad performance. The TCP side is much faster 
> in the later 1.3.x kernels, and the network cards can no longer keep up 
> so the throttle is essentially in effect _all_ the time. What you see is 
> probably due to:
> 
>  - TCP layer has a few packets queued up, sends one to the network driver
>  - network driver puts out the packet, sets tbusy
>  - TCP layer sees tbusy, and doesn't send any more
>  - network driver gets a "tx complete interrupt" and does a callback to 
>    net layer with mark_bh(NET_BH), and the cycle starts up again..

I don't know much about how the majority of network drivers are written, or
the best way to fix it, but the above is definitely true for my ARCnet
driver.  Because ARCnet is slower (around 200k/sec max) and has smaller
output buffers than ethernet the effect is much more noticeable.

Quickly:  ARCnet cards have four buffers of 512 bytes each.  Unlike many
ethernet cards, you don't have the option of dividing them up any way you
want; you can only tell the card to receive into or transmit from one of the
four buffers.  I seem to get the best balance of performance and simplicity
by splitting these up arbitrarily into two receive and two transmit buffers.

My ARCnet driver does what might be the "streamlining" Linus refers to.  It
follows logic more like this (which has been drastically simplified actually
as one ARCnet packet from the TCP layer might be broken into 3 or more
ARCnet packets when using RFC1201 encapsulation):

	- TCP layer sends one to ARCnet driver
	- driver marks tbusy=1
	- choose a TX buffer, copy packet into it
	- begin transmit
	- set internal buffer_busy flag=1
	- set tbusy=0

This means that while the packet is being sent, the kernel can actually be
loading another packet into the ARCnet buffer.  The original skeleton.c, at
least at the time, did not suggest doing this so the ARCnet driver didn't
either until rather late in its lifetime (just before Linux 1.2, I think).

As an example of the difference this makes, before I made the change I was
only getting around 120k/sec maximum, while now I regularly get around
190k/sec.  This is about a 37% improvement.  I expect it will be most
pronounced in slow cards where copying the packet into the buffer takes a
very long time, such as my 8-bit ARCnet cards.  8-bit NE2000's for example,
might have similar symptoms which have been blamed on mere slow cards.  (And
of course that would be a large portion of the problem :))

Things like 100baseT high-speed networks generally have busmastering or at
least basic PCI so packet-copying times are kept to a minimum.

One thing I would like to point out is that fiddling with the tbusy flag
threw the driver into fits of instability for several months.  You have to
be very, very careful when the tbusy flag is set and when it isn't.  I think
the "serialization" Linus refers to was brought in around 1.2.8 or 1.2.9,
which is when the ARCnet driver magically stabilized itself.

To be realistic the ARCnet driver is considerably more complicated than most
network drivers partially because of RFC1201-compliant driver-level
fragmentation, but that was easy until I started fiddling with tbusy.  I was
overjoyed when someone else fixed things in general around 1.2.9.  Debugging
kernel re-entrancy problems (which is basically what you get - a tx
interrupt might happen while you are in dev_send_packet) is a real pain.

So to summarize, setting tbusy the "right" is a good idea performance wise,
but you _will_ screw things up (unless you are a better programmer than me,
and there should be several of those on this list).  I would not suggest
going through all the drivers and just moving tbusy settings around without
testing _very_ thoroughly.

Have fun,

Avery


home help back first fref pref prev next nref lref last post