[8150] in linux-scsi channel archive
Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)
daemon@ATHENA.MIT.EDU (Guest section DW)
Tue Feb 22 11:35:31 2000
Message-ID: <20000222171107.A1076@win.tue.nl>
Date: Tue, 22 Feb 2000 17:11:07 +0100
From: Guest section DW <dwguest@win.tue.nl>
To: Matthias Andree <ma@dt.e-technik.uni-dortmund.de>,
linux-scsi@vger.rutgers.edu
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <m34sb1cof4.fsf@emma1.emma.line.org>; from Matthias Andree on Tue, Feb 22, 2000 at 03:17:03PM +0100
On Tue, Feb 22, 2000 at 03:17:03PM +0100, Matthias Andree wrote:
> I am wondering if the behaviour I see is correct. The system gets almost
> irresponsive while one disk is breaking down. Any access to the
> defective drive will bring the bus down for quite some time,
> effectively, several minutes.
>
> One drive (sdb, containing /usr/local which is nice to have not direly
> needed) dies at 6:33. In the following hour, the system goes into
> endless SCSI bus reset loops which cannot help since the hardware is
> broken. Eventually, after 7 o'clock, the driver degrades the bus, it
> finally is down to 20 MB/sec, without the slightest hope that this helps
> the drive out of its headcrash.
>
> During recovery, the bus resets trash right into read and write
> operations to /dev/sda which also abort. At short before 8 o'clock, the
> system has finally responded to its fstab edit, and can finally be shot
> out of the way and rebooted (no sync, no init: reboot -n -f), at the
> expense of gory fscks on the intact drives.
>
> What we see is an endless series of bus resets due to timeouts, domain
> revalidations and aborts on the bus, bringing the intact drive away from
> operation as well.
Yes. A very familiar scenario.
> I'd expect that after a couple of bus resets induced by only a single drive,
> the system decides that that drive is broken and does something similar
> to echo remove-single-device 0 0 1 0 >/proc/scsi/scsi or at least lock
> that drive so that the other operations can continue properly.
>
> I am annoyed by that Adaptec junk. Be it the 1542CF, be it the
> 2940UW. Expensive and useless if something goes wrong.
I'll not express an opinion about Adaptec - my main gripe is that
they refuse to provide docs - but the blame here is entirely on
the Linux kernel code. The SCSI error handling has always been a
mess, both the old and the new code.
Usually I disable the part of the error handling code that tries
to do bus device / bus / host resets for precisely the reason
you mention: these resets will kill a well-functioning system
that has one bad SCSI device.
Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu