[8206] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

semaphores and queueing commands

daemon@ATHENA.MIT.EDU (Matthew Dharm)
Thu Feb 24 01:21:57 2000

Date:   Wed, 23 Feb 2000 22:02:25 -0800 (PST)
From: Matthew Dharm <mdharm-scsi@one-eyed-alien.net>
To: The Linux SCSI list <linux-scsi@vger.rutgers.edu>
Message-ID: <Pine.LNX.4.10.10002232135460.29543-100000@ziggy.one-eyed-alien.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

Okay, this e-mail is likely to be long and rambling because I really don't
know what the problem is here.  So I'm going to basically tell you the
story of what I've been seeing and doing, so perhaps you can help.  The
short summary is this: I can't seem to get my HBA driver to handle command
queueing correctly.

Now, the story...

I'm working on the USB Mass Storage driver, which shows up as an emulated
SCSI host.  Since much of what has to happen is copying data from one
place to another, I decided to play around with scatter-gather and
clustering parameters to improve performance.

Well, I noticed that with both scatter-gather and clustering off, a
problem which has previously been infrequent became constant.  The problem
was this:  the command queuing was missing commands.

The queueing code has a queue depth of 1.  Basically, I put the command
out there on a pointer and do a wake_up() on a wait_head_t.  There is a
thread out there which runs in a loop, and at the top of the loop is a
sleep_on_interruptable() on the same wait_head_t.  Well, I realized that
what was happening was that after I called the scsi_done() function, but
before I could put the thread back to sleep, the SCSI layer was trying to
queue a new command for the host.  This called the wake_up(), but since I
wasn't allready sleeping on that wait queue, I missed the wakeup.

So, I decided to try using a semaphore instead.  Two of them, actually.
The idea would be that the host thread would try to down() the 'run'
semaphore at the top of the loop (run starts out locked).  When I had a
command queued, I would just up() that semaphore, and that would start the
thread processing the command.  To make sure that I didn't double-queue
commands, I used the second semaphore to protect the command pointer.

So, the queue command routine looked like this:

down(data_protect)
place scsi_cmnd where it can be found by the thread
up(data_protect)
up(run)

and the host thread looks like this:

down(run)
down(data_protect)
do stuff
do stuff
do lots more stuff
call the scsi_done() function
up(data_protect)

Note that data_protect starts unlocked, and run starts locked.

Well, initial results looked good.  Fewer commands were missed and
response times were significantly better.  So, I decided to do a little
stress test.  That's when I ran into problems.

Under heavy access, the kernel either Aiees or Panics.  Unfortunately,
nothing that indicates anything went wrong is in the logs.  The
information dumped to the screen scrolls so that all I can see is 23 rows
of hex numbers (addresses?) and a few 'interesting' lines at the bottom.
The stress tests that I tried used hdparms -t, fsck.ext2, and rm -rf on a
USB Zip 250.

So, what were those interesting lines?  The first two times I tried this,
I got the following (note that these are hand-copied and might be prone
to slight errors):

Code: 89 02 85 c0 74 03 89 50 04 b8 01 00 00 00 eb 05 8d 76 00 31
Aiee, killing interrupt handler
Unable to handle kernel paging request

Note that even the code line was the same the first two times.  I tried
re-writing a suspicious looking piece of code and recompiling the module.
When I did that, I got the following:

Code: 89 50 04 b8 01 00 00 00 eb 05 8d 76 00 31 c0 c7 43 04 00 00
Kernel panic: Attempted to kill the idle task!
In interrupt handler - not syncing.

It's interesting to note that the 'code' line is actually the same as the
first, just shifted down 6 bytes or so.  BTW, this is on an AMD-K6II
400MHz machine running 2.3.47.  After I see these messages, the machine is
hard-locked.  Nothing responds (including magic-SysRq).  I have to use the
reset button to reboot the machine.

It's pretty clear that, whatever I'm doing, it's making interrupt handling
upset.  Am I not allowed to do a down() and block on a semaphore in a
queue command routine?  Or am I simply blocking for too long?  How long is
too long, really?

Another thought I had was that when I call scsi_done(), I'm indirectly
causing the command queueing function to be called.  But, I've allready
locked data_protect, thus the queueing function blocks indefinately.  Is
that how scsi_done() works?  I really have no idea what the internals of
that mechanism are.

I should point out that, before I made the jump to semaphores, the driver
worked well for these tests.  Yes, I was loosing some commands.  But there
were no OOPSes or Panics or Aiees during that time.

And, while I'm writting this, is there some generic kernel queuing code
that I can use somewhere?  i.e. something that allows me to declare a
queue, and enqeue pointers on one end, and provides a routine to get them
off the other end (blocking if necessary/desired)?  Eventually I would
actually like to support command queueing for real, not just with a depth
of 1.

Any help that anyone can give me on this subject would be greatly
appreaciated.

Matt Dharm

-- 
Matthew Dharm                              Home: mdharm@one-eyed-alien.net 
Engineer, Qualcomm, Inc.                         Work: mdharm@qualcomm.com

P:  How about "Web Designer"?
DP: I'd like a name that people won't laugh at.
					-- Pitr and Dust Puppy
User Friendly, 12/6/1997


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu

home help back first fref pref prev next nref lref last post