[548] in linux-net channel archive
double (un)lock on device queue - skb probs.
daemon@ATHENA.MIT.EDU (Paul Gortmaker)
Tue Jun 20 13:33:24 1995
From: Paul Gortmaker <Paul.Gortmaker@anu.edu.au>
To: linux-net@vger.rutgers.edu
Date: Wed, 21 Jun 1995 01:22:39 +1000 (EST)
I noticed that while bashing out some tests, that I was able to generate
a lot of "double lock on device queue" messages. (I also got a single
"double unlock on device queue" once, which is even scarier.)
I figured that it was something that I managed to break in some obscure
way with my patches, but when I went to a clean v1.2.10, it was still
there. I then went back and rebuilt a clean 1.2.8, but it too did the
same thing. And sure enough, 1.3.3 suffers from it as well. (gcc-2.5.8
was used in all cases)
The test was a simple:
rsh otherhost 'cat linux-1.3.0.tar.gz' > /dev/null
contained in a loop.
This is from stock 1.2.10, after stopping the test. Note that even after
the network is quiet, there is still >1MB left in stale skb's. Also note
the number of "free while locked events" is huge. If you wait long enough
all the memory gets eaten up and you grind to a halt. Not good.
Networking buffers in use : 754
Memory committed to network buffers: 1281318
Network buffers locked by drivers : 14
Total network buffer allocations : 735285
Total failed network buffer allocs : 0
Total free while locked events : 1781
System is a 486DX33, 16MB, 79c970/lance on 7.15MHz ISA bus. The box that
is issuing the rsh is a lowly 4MB unit with a NE2k card, running 1.3.3
and it doesn't complain. Everything drops to zero on the NE2k when you
stop the test (except the total allocations of course) - Hence I suspect
that it is something in the lance driver.
Now here is what (I think) is happening. During an xmit, most drivers
peel the data out of the skb via memcpy() or whatever, and then do
a dev_kfree_skb(skb, FREE_WRITE) before exiting the xmit function.
However, the lance (and the tulip) driver hold onto the Tx skb until
the interrupt handler rec's a Tx-done interrupt. My two guesses are that
the lance driver is munging its internal skb list under heavy Tx activity,
or something deep in the net code is shuffling the skb's after the Tx
function completes, and hence the lance's personal Tx skb list goes
out of sync. Either way this is bad. I spent a while looking at the
code, but nothing jumped off the page at me.
Paul.