[722] in linux-scsi channel archive
Re: Serious problem with SCSI error handling
daemon@ATHENA.MIT.EDU (Eric Youngdale)
Thu Oct 26 07:44:50 1995
From: "Eric Youngdale" <eric@aib.com>
Date: Wed, 25 Oct 1995 19:38:36 -0400
In-Reply-To: Jurgen Botz <jbotz@orixa.mtholyoke.edu>
"Serious problem with SCSI error handling" (Oct 11, 12:32pm)
To: Jurgen Botz <jbotz@orixa.mtholyoke.edu>, Dave Andersen <angio@aros.net>
Cc: ncr53c810@mroe.cs.colorado.edu, linux-scsi@vger.rutgers.edu
>Conclusion: the SCSI code has a bug (or bugs) that throw things into
>a bad state on certain disk errors that it *should* be able to recover
>from. This does not appear to be in the NCR driver, but more likely
>in the higher level SCSI disk code, since Dave saw the same problem
>with a different controller. The problem seems to exist in kernel
>versions at least 1.2.x through 1.3.32.
I would beg to differ. The mid level code has all sorts of
checks so that it will retry commands that timeout and attempt to abort
reset if the bus would appear to be hung. Unfortunately the low level
driver needs to know what to do when this sort of request comes along.
Whether an error can be recovered from depends upon what is
wrong. I have seen bugs in firmware cause hard crashes of a 1542 that
force me to power cycle the machine to get it up again. There are
also cases where the firmware on the drive itself crashes, and this
also is difficult to recover from.
Most people do not experience these problems - usually the
drive will remap sectors automatically for you if it can and if this feature
is enabled. In other cases, the drive reports a bad sector, and the
filesystem or whatever must deal with it somehow.
>I'm willing to put a fair amount of time into tracking-down/debugging
>this, but having no experience with SCSI driver code I doubt I could get
>very far by myself. Furthermore I don't know if this is a simple bug or
>a major design flaw. If any of the SCSI experts out there would like
>to work with me I would be delighted.
To begin with, I need to know the actual failure mechanism.
Are the commands timing out, or is there some error condition returned
from the commands. We can take it from there.
-Eric
--
"The woods are lovely, dark and deep. But I have promises to keep,
And lines to code before I sleep, And lines to code before I sleep."