[8167] in linux-scsi channel archive
Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)
daemon@ATHENA.MIT.EDU (Guest section DW)
Tue Feb 22 21:41:01 2000
Message-ID: <20000222212936.A1097@win.tue.nl>
Date: Tue, 22 Feb 2000 21:29:36 +0100
From: Guest section DW <dwguest@win.tue.nl>
To: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: ma@dt.e-technik.uni-dortmund.de (Matthias Andree),
linux-scsi@vger.rutgers.edu
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <E12NJRx-0002EF-00@the-village.bc.nu>; from Alan Cox on Tue, Feb 22, 2000 at 05:50:14PM +0000
On Tue, Feb 22, 2000 at 05:50:14PM +0000, Alan Cox wrote:
> > Usually I disable the part of the error handling code that tries
> > to do bus / host resets for precisely the reason you mention:
> > these resets will kill a well-functioning system
> > that has one bad SCSI device.
>
> Do you think a SCSI blacklist entry for 'don't bus reset a bus with
> this piece of junk on' would help ?
No. The problem is not the device, it is our code.
I have seen this with disks that got a bad block,
with a disk that had a head crash, with bad CDROMs,
with a bad tape, with a scanner.
Only in the last case the device itself was a piece of junk.
If you read scsi_error.c the philosophy is: "Something went
wrong, what can we do to get it working again?".
And increasingly powerful measures are taken.
But I do not need a bus reset or host reset when the CDROM drive
times out on a marginal CD or when some disk stops functioning.
Indeed, there never are any disk errors. And when there are,
then I do not want to touch that drive anymore. I do not want
to get it functioning. The system is to leave it alone and
leave messages in the log, so that I can attempt to rescue
the contents later before discarding this disk.
About a year ago I lost a disk and the log showed long
continuous beating on the same disk area.
MEDIUM ERROR - TIMEOUT - ABORT - RESET - etc etc
When I got to this machine the next morning, the disk was
too hot to touch, and did not react to anything anymore.
With ext2 we have choices like panic on error / read-only on error /
continue on error. Similarly we could add per-device scsi choices:
on error, leave device alone / on error, beat device into submission.
[But this is an old discussion. Maybe I am alone with the point of
view that bus resets are terrible. At least Eric seems to think
that the probability that something useful is achieved by a reset
is larger than the probability that the situation only gets worse.
I think that bus resets should be initiated by a human only.]
Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu