[548] in linux-net channel archive


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

double (un)lock on device queue - skb probs.

daemon@ATHENA.MIT.EDU (Paul Gortmaker)
Tue Jun 20 13:33:24 1995

From: Paul Gortmaker <Paul.Gortmaker@anu.edu.au>
To: linux-net@vger.rutgers.edu
Date: Wed, 21 Jun 1995 01:22:39 +1000 (EST)

I noticed that while bashing out some tests, that I was able to generate
a lot of "double lock on device queue" messages. (I also got a single
"double unlock on device queue" once, which is even scarier.) 

I figured that it was something that I managed to break in some obscure 
way with my patches, but when I went to a clean v1.2.10, it was still 
there. I then went back and rebuilt a clean 1.2.8, but it too did the 
same thing. And sure enough, 1.3.3 suffers from it as well. (gcc-2.5.8 
was used in all cases)

The test was a simple:
	 rsh otherhost 'cat linux-1.3.0.tar.gz' > /dev/null
contained in a loop.

This is from stock 1.2.10, after stopping the test. Note that even after
the network is quiet, there is still >1MB left in stale skb's. Also note
the number of "free while locked events" is huge. If you wait long enough
all the memory gets eaten up and you grind to a halt. Not good.

Networking buffers in use          : 754
Memory committed to network buffers: 1281318
Network buffers locked by drivers  : 14
Total network buffer allocations   : 735285
Total failed network buffer allocs : 0
Total free while locked events     : 1781

System is a 486DX33, 16MB, 79c970/lance on 7.15MHz ISA bus. The box that
is issuing the rsh is a lowly 4MB unit with a NE2k card, running 1.3.3
and it doesn't complain. Everything drops to zero on the NE2k when you
stop the test (except the total allocations of course) - Hence I suspect
that it is something in the lance driver.

Now here is what (I think) is happening. During an xmit, most drivers
peel the data out of the skb via memcpy() or whatever, and then do
a dev_kfree_skb(skb, FREE_WRITE) before exiting the xmit function.
However, the lance (and the tulip) driver hold onto the Tx skb until
the interrupt handler rec's a Tx-done interrupt. My two guesses are that
the lance driver is munging its internal skb list under heavy Tx activity, 
or something deep in the net code is shuffling the skb's after the Tx
function completes, and hence the lance's personal Tx skb list goes 
out of sync. Either way this is bad. I spent a while looking at the
code, but nothing jumped off the page at me.

Paul.


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

[548] in linux-net channel archive

double (un)lock on device queue - skb probs.

daemon@ATHENA.MIT.EDU (Paul Gortmaker)Tue Jun 20 13:33:24 1995

daemon@ATHENA.MIT.EDU (Paul Gortmaker)
Tue Jun 20 13:33:24 1995