[685] in linux-scsi channel archive
Serious problem with SCSI error handling
daemon@ATHENA.MIT.EDU (Jurgen Botz)
Wed Oct 11 23:12:44 1995
To: Dave Andersen <angio@aros.net>
cc: ncr53c810@mroe.cs.colorado.edu, linux-scsi@vger.rutgers.edu
In-reply-to: Your message of "Tue, 10 Oct 1995 15:17:03 MDT."
<199510102117.PAA06272@terra.aros.net>
Date: Wed, 11 Oct 1995 12:32:55 -0400
From: Jurgen Botz <jbotz@orixa.mtholyoke.edu>
Dave Andersen wrote:
>>[I wrote of strange SCSI errors with a set of brand new SCSI drives that
>> are rendering the drives unusable.]
>
> Drew replied that you've got bad blocks on the drives, and I think
> I'll have to disagree.
> [Similar story to mine deleted]
> (Because it wasn't just with the NCR driver,
> I suspect the problem is somewhere else in linux's scsi handling).
Uh-huh. I've done more thorough tests... I hooked the disks
up to a Mac and ran a SCSI utility called "Silver Lining" to do a
surface analysis. The drives that Linux chokes on do have some
errors, but minor ones... Silver Lining wouldn't even map out the
blocks because it could recover from the errors. In short, there
are problems, but the disks should be useable. Instead once Linux
hits one of the bad blocks it does not recover and furthermore it
then starts getting I/O errors on good blocks as well.
Conclusion: the SCSI code has a bug (or bugs) that throw things into
a bad state on certain disk errors that it *should* be able to recover
from. This does not appear to be in the NCR driver, but more likely
in the higher level SCSI disk code, since Dave saw the same problem
with a different controller. The problem seems to exist in kernel
versions at least 1.2.x through 1.3.32.
This is very bad for me... like Dave I am trying to set up a news
server. I really want to use Linux, as a believe that ex2fs is a big
win for a news server, and also because I love Linux, the Linux
community, and the GPL. But unless I can resolve this fairly quickly
I shall have to use FBSD or BSD-OS. As a System Administrator for a
fairly large site I want to move to an OS to which I have full source
for all server applications, and I had hoped that Linux was ready for
to be put into this service, but now I have serious doubts. I can get
my disks replaced with error-free ones, but there's no guarantee that
they'll stay error-free and the OS /must/ be able to handle minor
errors more gracefully than what I'm seeing now.
I'm willing to put a fair amount of time into tracking-down/debugging
this, but having no experience with SCSI driver code I doubt I could get
very far by myself. Furthermore I don't know if this is a simple bug or
a major design flaw. If any of the SCSI experts out there would like
to work with me I would be delighted.