[8150] in linux-scsi channel archive


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Guest section DW)
Tue Feb 22 11:35:31 2000

Message-ID: <20000222171107.A1076@win.tue.nl>
Date:   Tue, 22 Feb 2000 17:11:07 +0100
From: Guest section DW <dwguest@win.tue.nl>
To: Matthias Andree <ma@dt.e-technik.uni-dortmund.de>,
	linux-scsi@vger.rutgers.edu
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <m34sb1cof4.fsf@emma1.emma.line.org>; from Matthias Andree on Tue, Feb 22, 2000 at 03:17:03PM +0100

On Tue, Feb 22, 2000 at 03:17:03PM +0100, Matthias Andree wrote:

> I am wondering if the behaviour I see is correct. The system gets almost
> irresponsive while one disk is breaking down. Any access to the
> defective drive will bring the bus down for quite some time,
> effectively, several minutes. 
> 
> One drive (sdb, containing /usr/local which is nice to have not direly
> needed) dies at 6:33. In the following hour, the system goes into
> endless SCSI bus reset loops which cannot help since the hardware is
> broken. Eventually, after 7 o'clock, the driver degrades the bus, it
> finally is down to 20 MB/sec, without the slightest hope that this helps
> the drive out of its headcrash.
> 
> During recovery, the bus resets trash right into read and write
> operations to /dev/sda which also abort. At short before 8 o'clock, the
> system has finally responded to its fstab edit, and can finally be shot
> out of the way and rebooted (no sync, no init: reboot -n -f), at the
> expense of gory fscks on the intact drives.
> 
> What we see is an endless series of bus resets due to timeouts, domain
> revalidations and aborts on the bus, bringing the intact drive away from
> operation as well.

Yes. A very familiar scenario.

> I'd expect that after a couple of bus resets induced by only a single drive,
> the system decides that that drive is broken and does something similar
> to echo remove-single-device 0 0 1 0 >/proc/scsi/scsi or at least lock
> that drive so that the other operations can continue properly. 
> 
> I am annoyed by that Adaptec junk. Be it the 1542CF, be it the
> 2940UW. Expensive and useless if something goes wrong.

I'll not express an opinion about Adaptec - my main gripe is that
they refuse to provide docs - but the blame here is entirely on
the Linux kernel code. The SCSI error handling has always been a
mess, both the old and the new code.

Usually I disable the part of the error handling code that tries
to do bus device / bus / host resets for precisely the reason
you mention: these resets will kill a well-functioning system
that has one bad SCSI device.

Andries

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

[8150] in linux-scsi channel archive

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Guest section DW)Tue Feb 22 11:35:31 2000

daemon@ATHENA.MIT.EDU (Guest section DW)
Tue Feb 22 11:35:31 2000