[3033] in linux-scsi channel archive
Re: RAID & unhappy scsi driver
daemon@ATHENA.MIT.EDU (linas@linas.org)
Mon Jan 5 15:22:49 1998
To: dledford@dialnet.net (Doug Ledford)
Date: Mon, 5 Jan 1998 15:32:44 -0600 (CST)
Cc: linas@fc.net, linux-eata@trudi.zdv.Uni-Mainz.DE,
linux-scsi@vger.rutgers.edu, linux-raid@vger.rutgers.edu
In-Reply-To: <XFMail.980105033619.dledford@dialnet.net> from "Doug Ledford" at Jan 5, 98 03:16:30 am
From: linas@linas.org
It's been rumoured that Doug Ledford said:
> >I was unable to reboot until I went into bios and disabled the disk.
>
> This is definitely a sign of something to do with the disk and not the
> driver or controller.
Well, the disk *did* make that distinctive scsi kerplunk noise. I last
heard this noise on failed 12 inch multi-platter 200MB drive a decade ago.
I was surprised that the sound was the same!
> The only RAID solutions intended to address issues such as this are RAID 1
> and 5. What RAID are you running?
Both.
> It's very likely that the controller and driver were fine. More likely is
> that the particular error mode the Seagate drive went into resulted in
> complete SCSI bus wedges. Keep in mind that the SCSI bus has a shared BUSY
> signal pin. Any device can make that pin active. If it makes that pin
> active and never releases it, then nothing, and I mean *nothing*, will take
> place on that bus until the drive is removed.
The "busy" LED on the drive was full-on when I found it. Assuming the LED is
wired to the busy pin, that would match your explanation.
Although I did not watch carefully, the hangs did seem associated with
the light going on & staying on.
> Now, it very well may be a
> case of something like the drive will negotiate just fine and respond to the
> normal inquiry commands at bootup, but the first time you try to use it (or
> the first time you access a particular media location) the drive could end
> up going to Kansas.
Yep, that seemd to be the symptom.
> but, the simple fact of life is, with this type of failure, if the SCSI
> subsystem doesn't quit sending commands to that drive sometime, then it's
> going to make the machine unuseable forever.
Don't know is interrupts are blocked during reset, but it would be nice
to continue multi-tasking as much as possible in the interim. This was
not the case for me, since, presumably the bus was locked, and some critical
part of the OS had been paged out.
This may be a case for maintaining a permanent initrd for mission-critical
servers. Details elude me, but at least a kernel & some network utils
should be kept there.
> At some point, the RAID code
> would have to take the target drive off line. What's worse, if the bus
> reset didn't shake the flaky drive loose so that at least the other drives
> could work again, then it could render every drive on that bus dead.
... initrd ...
> Additionally, even if the drive does shake loose on a reset, the code
> snippet above could result in all drives going dead if every time the mid
> level SCSI code sends commands back to us to be re-tried it always sends
> commands for the flaky drive first and other drives later. In that case, we
> would send our first command to the flaky drive, it would re-wedge the SCSI
> bus, and because it was the first command sent, the other drives would never
> have a chance to complete a command and reset their own DEVICE_SUCCESS
> flags, which means there would be no way to differentiate between the flaky
> drive and the others on the SCSI bus and they would *all* get marked as bad.
I'm wondering if it may be possible to keep a history of drive access patterns
shortly after a bus reset. Through automatic analysis of the failures and
lessons learned through subsequent failures & resets, it might be possible
to locate the misbehaving drive. I don't know enough about the susystem
to understand how practical this would be; I have written automatic error-recovery
code several times, and in each case, it snowballed into a highly intertwined
mess that always seemd to have some bug that got triggered less and less often
with each patch.
--linas