[8167] in linux-scsi channel archive


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Guest section DW)
Tue Feb 22 21:41:01 2000

Message-ID: <20000222212936.A1097@win.tue.nl>
Date:   Tue, 22 Feb 2000 21:29:36 +0100
From: Guest section DW <dwguest@win.tue.nl>
To: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: ma@dt.e-technik.uni-dortmund.de (Matthias Andree),
	linux-scsi@vger.rutgers.edu
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <E12NJRx-0002EF-00@the-village.bc.nu>; from Alan Cox on Tue, Feb 22, 2000 at 05:50:14PM +0000

On Tue, Feb 22, 2000 at 05:50:14PM +0000, Alan Cox wrote:

> > Usually I disable the part of the error handling code that tries
> > to do bus / host resets for precisely the reason you mention:
> > these resets will kill a well-functioning system
> > that has one bad SCSI device.
> 
> Do you think a SCSI blacklist entry for 'don't bus reset a bus with
> this piece of junk on' would help ?

No. The problem is not the device, it is our code.

I have seen this with disks that got a bad block,
with a disk that had a head crash, with bad CDROMs,
with a bad tape, with a scanner.
Only in the last case the device itself was a piece of junk.

If you read scsi_error.c the philosophy is: "Something went
wrong, what can we do to get it working again?".
And increasingly powerful measures are taken.
But I do not need a bus reset or host reset when the CDROM drive
times out on a marginal CD or when some disk stops functioning.

Indeed, there never are any disk errors. And when there are,
then I do not want to touch that drive anymore. I do not want
to get it functioning. The system is to leave it alone and
leave messages in the log, so that I can attempt to rescue
the contents later before discarding this disk.
About a year ago I lost a disk and the log showed long
continuous beating on the same disk area.
MEDIUM ERROR - TIMEOUT - ABORT - RESET - etc etc
When I got to this machine the next morning, the disk was
too hot to touch, and did not react to anything anymore.

With ext2 we have choices like panic on error / read-only on error /
continue on error. Similarly we could add per-device scsi choices:
on error, leave device alone / on error, beat device into submission.

[But this is an old discussion. Maybe I am alone with the point of
view that bus resets are terrible. At least Eric seems to think
that the probability that something useful is achieved by a reset
is larger than the probability that the situation only gets worse.
I think that bus resets should be initiated by a human only.]

Andries

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

[8167] in linux-scsi channel archive

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Guest section DW)Tue Feb 22 21:41:01 2000

daemon@ATHENA.MIT.EDU (Guest section DW)
Tue Feb 22 21:41:01 2000