[3029] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

Re: RAID & unhappy scsi driver

daemon@ATHENA.MIT.EDU (Gadi Oxman)
Mon Jan 5 08:59:03 1998

Date: 	Mon, 5 Jan 1998 15:46:35 +0300 (IST)
From: Gadi Oxman <gadio@netvision.net.il>
To: Linas Vepstas <linas@fc.net>
cc: dledford@dialnet.net, linux-raid@vger.rutgers.edu,
        linux-scsi@vger.rutgers.edu, linux-eata@trudi.zdv.Uni-Mainz.DE
In-Reply-To: <34B0AD32.1B21E4AC@fc.net>

On Mon, 5 Jan 1998, Linas Vepstas wrote:

> I am disappointed to point out the following kernel "bug":
> 
> Recently set up RAID w/ several seagates & adaptec 2940 on 2.0.33
> kernel.
> After a few weeks, one of the drives failed.
> 
> I was unhappy to find the machine all-but locked up as a result,
> un pingable, un telnetable, etc.  (although the keyboard did wake
> up the sleeping monitor.)  Appearently the aic7xxx driver entered
> into some sort of infinite loop attempting to reset the scsi disk.
> 
> I was unable to reboot until I went into bios and disabled the disk.
> 
> This kind of driver behaviour completely negates the point of
> hot-plug drive bays, severly impacts high-availability, and puts
> a big dent in the philosphy of RAID.
> 
> Anyone experience anything similar?  Anyone working to improve
> the driver?
> 
> --------
> It occurs to me that some RAID setups might use one controller per disk,
> 
> to avoid outage due to controller failure.   But are the scsi device
> drivers
> robust enough to not hang/panic the kernel if a controller fails to
> respond?
> 
> --linas
> 
> 
> --
> Linas Vepstas   -- linas@linas.org.spam.stopper -- http://linas.org/

One thing to keep in mind is that even if we are unable to recover cleanly
from a failed disk scenario, the redundancy is still mostly available -- in
the worst case, the kernel will crash but on the next boot we will be able
to access most of the data using the other operational drives.

Recovering gracefully from a failed disk scenario is certainly not easy
for the drive's firmware and for the kernel. Even when everything goes
smoothly, we will probably still experience an unpleasent error recovery
period since we currently aren't able to "take back" requests which we
already queued to the failed drive prior to the failure, and we might be
trying very hard to recover in the low level drivers.

Gadi

home help back first fref pref prev next nref lref last post