[3035] in linux-scsi channel archive
Re: RAID & unhappy scsi driver
daemon@ATHENA.MIT.EDU (Doug Ledford)
Mon Jan 5 17:16:44 1998
In-Reply-To: <199801052132.PAA19260@shadygrove.linas.org>
Date: Mon, 05 Jan 1998 15:32:07 -0600 (CST)
From: Doug Ledford <dledford@dialnet.net>
To: linas@linas.org
Cc: linux-raid@vger.rutgers.edu, linux-scsi@vger.rutgers.edu,
linux-eata@trudi.zdv.Uni-Mainz.DE, linas@fc.net
On 05-Jan-98 linas@linas.org wrote:
>It's been rumoured that Doug Ledford said:
>> >I was unable to reboot until I went into bios and disabled the disk.
>>
>> This is definitely a sign of something to do with the disk and not the
>> driver or controller.
>
>Well, the disk *did* make that distinctive scsi kerplunk noise. I last
>heard this noise on failed 12 inch multi-platter 200MB drive a decade ago.
>I was surprised that the sound was the same!
heheheh :)
>> It's very likely that the controller and driver were fine. More likely
>is
>> that the particular error mode the Seagate drive went into resulted in
>> complete SCSI bus wedges. Keep in mind that the SCSI bus has a shared
>BUSY
>> signal pin. Any device can make that pin active. If it makes that pin
>> active and never releases it, then nothing, and I mean *nothing*, will
>take
>> place on that bus until the drive is removed.
>
>The "busy" LED on the drive was full-on when I found it. Assuming the LED
>is
>wired to the busy pin, that would match your explanation.
>
>Although I did not watch carefully, the hangs did seem associated with
>the light going on & staying on.
Very likely my description is correct then.
>> but, the simple fact of life is, with this type of failure, if the SCSI
>> subsystem doesn't quit sending commands to that drive sometime, then it's
>> going to make the machine unuseable forever.
>
>Don't know is interrupts are blocked during reset, but it would be nice
>to continue multi-tasking as much as possible in the interim. This was
>not the case for me, since, presumably the bus was locked, and some
>critical
>part of the OS had been paged out.
Not that the OS had been paged out. The actual cause of that particular
behavior is something I am working on. The current driver uses all
cli();sti(); type locking conventions. During a reset routine, we *have* to
lock out the interrupt driver entirely because our reset routine mucks with
the *entire* SCB array, all of the various queues, and the card hardware.
So, we disable interrupts. We are only in the interrupt routine for less
than a second though (far less, more like a few milliseconds for a full bus
reset). But, in order to gaurantee that devices have a chance to settle
after the reset before commands start flowing again, we set the value of
host->last_reset in the mid level code to a value that happens to be in the
future. As it turns out, the mid level code appears to keep interrupts
locked down while waiting for this value to expire, thus resulting in the
behaviour you see.
>This may be a case for maintaining a permanent initrd for mission-critical
>servers. Details elude me, but at least a kernel & some network utils
>should be kept there.
To be quite honest, the failure mode you saw (the drive going totally out to
lunch) isn't a real common error mode. It usually indicates that the error
on the drive itself was so severe that the firmware didn't know what to do
or it is a bug in the firmware. As I recall, you said this was a Seagate
disk. I'm a little suprised that Seagate, who usually has above par
firmware, would have this problem. But, this falls in the category of
catastrophic memory bus failure. Unless you build complete system
redundancy, you won't get around this problem. I'm not sure there is a
hardware RAID product on the market that could survive this since the drive
was actually taking out the entire SCSI bus.
>I'm wondering if it may be possible to keep a history of drive access
>patterns
>shortly after a bus reset. Through automatic analysis of the failures and
>lessons learned through subsequent failures & resets, it might be possible
>to locate the misbehaving drive. I don't know enough about the susystem
>to understand how practical this would be; I have written automatic
>error-recovery
>code several times, and in each case, it snowballed into a highly
>intertwined
>mess that always seemd to have some bug that got triggered less and less
>often
>with each patch.
This sounds about right. However, as you mentioned in another email, it
says something about linux that we know have to worry about these types of
things. It hasn't been something I've worried about in the past. However,
this issue has given me an idea. In my current working code, I'm keeping
track of the currently active commands on the bus. It would be an almost
trivial thing to add a small section of code so that on a second bus reset
that appears to be a repeat issue, we switch from normal command mode to
single command at a time mode (I already maintain my own internal queues,
one for each device, one waiting queue, and one complete queue). I could
simply watch the number of globally active commands after a repeat reset
condition, hold it down to one command at a time, and wait for the reset to
occur again (given a command limit of course). Once we get another reset,
the only active command on the bus (and it's target) should be the culprit,
so I could specifically tell the mid level code that the command in question
should be entirely aborted instead of re-tried. Hell, with a single command
at a time, I don't even need to wait for the reset, I can get pre-emptive on
the abort cycle. I could then set either a counter or a timer, and after x
commands or x seconds, if nothing has happened again, start opening the bus
back up to full speed. That might allow me (and the mid level code) to
track down the defective drive and get on with life.
----------------------------------
E-Mail: Doug Ledford <dledford@dialnet.net>
Date: 05-Jan-98
Time: 15:32:08
----------------------------------