[8192] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Eric Youngdale)
Wed Feb 23 18:34:46 2000

Message-ID: <002501bf7e48$5fe743c0$940310ac@fairfax.datafocus.com>
From: "Eric Youngdale" <eric@andante.org>
To: "Matthias Andree" <ma@dt.e-technik.uni-dortmund.de>,
	"Ishikawa" <ishikawa@yk.rim.or.jp>
Cc: "Guest section DW" <dwguest@win.tue.nl>,
	<linux-scsi@vger.rutgers.edu>
Date:   Wed, 23 Feb 2000 16:53:00 -0500
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit


----- Original Message -----
From: Matthias Andree <ma@dt.e-technik.uni-dortmund.de>
To: Ishikawa <ishikawa@yk.rim.or.jp>
Cc: Guest section DW <dwguest@win.tue.nl>; <linux-scsi@vger.rutgers.edu>
Sent: Wednesday, February 23, 2000 12:30 PM
Subject: Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)


> > Anyway, it would be nice to see linux's SCSI system to become more
robust
> > in terms of handling these exceptional cases.
> > These are exceptional indeed, but when they happen, we are
> > often in a mess.
>
> Still, it's a normal condition that a drive fails, be it that bad
> blocks are not transparently reassigned, be it that a CD-ROM is
> scratched.

    Sorry if someone has already said this - I am trying to pick up this
thread from the middle, and I don't quite know where this got started.

    It sounds like you were using the old error handling code.  This
infinite string of resets/aborts/whatnot that brings the system to it's
knees is a common result when that code gets exercised.  Please note that
the old error handling code only lives in scsi_obsolete.c.  The new error
handling code only lives in scsi_error.c

    The new error handling code takes a stab at it, and if recovery fails,
it puts the device in a state called "offline".  In this state, all attempts
to use the device will fail, and the system should once again be in a fairly
normal state (other than anyone trying to access the offline disk).  It
doesn't force an unmount, however.   This is something I have been meaning
to do something about (i.e. provide a way to unmount and unload an offline
device).

    One test that I run from time to time is to attempt to tar up a CDROM
with some bad sectors (either this, or there is dirt on the lens), but in
any case I get slews of media errors from the thing.   With the new error
handling code it eventually wades through most of it - at some point in the
disc things get so bad that commands start timing out.  It then starts in
the abort/reset thing, and this ultimately fails so it takes the thing
offline.  After this the system is normal.

    I *believe* that a medium error by itself should never result in the
device being taken offline - the offline state implies that the device
itself is in a wonky state and it isn't safe to send it any more commands.
As long as the device answers (even if you don't like the result), then it
shouldn't do much of anything other than perhaps retry a limited number of
times.

-Eric



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu

home help back first fref pref prev next nref lref last post