[8172] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Ishikawa)
Wed Feb 23 01:00:02 2000

Message-ID: <38B36A28.645716E7@yk.rim.or.jp>
Date:   Wed, 23 Feb 2000 14:03:36 +0900
From: Ishikawa <ishikawa@yk.rim.or.jp>
MIME-Version: 1.0
To: Guest section DW <dwguest@win.tue.nl>
Cc: linux-scsi@vger.rutgers.edu
Content-Type: text/plain; charset=iso-8859-15
Content-Transfer-Encoding: 7bit

Guest section DW wrote:

> On Tue, Feb 22, 2000 at 05:50:14PM +0000, Alan Cox wrote:
>
> > > Usually I disable the part of the error handling code that tries
> > > to do bus / host resets for precisely the reason you mention:
> > > these resets will kill a well-functioning system
> > > that has one bad SCSI device.
> >
> > Do you think a SCSI blacklist entry for 'don't bus reset a bus with
> > this piece of junk on' would help ?
>
> No. The problem is not the device, it is our code.
>
> I have seen this with disks that got a bad block,
> with a disk that had a head crash, with bad CDROMs,
> with a bad tape, with a scanner.
> Only in the last case the device itself was a piece of junk.
>
>

I personally have seen it with a failing hard disk,
and, as many of you recall,  with multi-lun Nakamichi
MBR-7 cd changer.

>
>
> [But this is an old discussion. Maybe I am alone with the point of
> view that bus resets are terrible. At least Eric seems to think
> that the probability that something useful is achieved by a reset
> is larger than the probability that the situation only gets worse.
> I think that bus resets should be initiated by a human only.]
>

I would think that bus RESET really ought to be saved for
real emergency and if other devices that are not
malfunctioning can be left alone.

For comparison purposes, this is what
somewhat old SunOS 4.1.4 handled this type of SCSI errors.

I have seen old SunOS 4.1.4 on sparc hardware
handled such hard disk errors, tape drive misbehavior
(probably dirty head) and other SCSI problems in a graceful manner.
I say "graceful" in the sense that it handled the error without
impacting the overall operation of the system If it could.
(Of course, the error to swap device or hosed "/" partition is
not easy to handle even for SunOS 4.1.4. But if the mis-behaving
device is used only by a user process or two, why impact the
rest of processes that could go running.)

SunOS log showed many types of SCSI problems in the past:

 - read/write error to a certain block that initially failed, but
   was re-tried and succeeded. It showed the absolute block number and
   if you see such errors in a short span of time on the same number,
   it is time to go and fetch the spare drive.
  (This would be nice to have on Linux. Probably there is already, but
   the message outght to be easy to understand for non-SCSI gurus.
   The one from SunOS 4.1.4 is rather easy to understand.)

 - tape drive being reset (or something.) after exessive number
   of read verify errors. (My memory is hazy on this particular
   type of message and so the cause could be different).

 - SCSI bus slowed down due to certain errors. Again, I forgot
   what caused this type of messages, but there are occasions
   when the SunOS reported that the transfer between certain
   devices is graded down to a slower speed after seeing certain
   type of errors. I think this DOES involve RESET.

I understand that the Free Solaris 7 for x86 scsi driver is similar now
to the ones used in the Sparc Solaris and found indeed that the
failing disk that could not be handled well due to exessive RESETs
in early versions of linux could be handled gracefully on Free Solaris 7
for x86.

Anyway, it would be nice to see linux's SCSI system to become more robust
in terms of handling these exceptional cases.
These are exceptional indeed, but when they happen, we are
often in a mess.
I would rather see the system run at reduced scsi bus
speed if necessary and/or a process or two become hung
than to face a non-working system not responding to our keyboard
in the morning.

If you could incorporate the features I mentioned above from
my experience with SunOS, It would be great.


Again, thank you for your SCSI work, folks.

Happy Hacking,





-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu

home help back first fref pref prev next nref lref last post