[686] in linux-scsi channel archive
Re: Serious problem with SCSI error handling
daemon@ATHENA.MIT.EDU (Dave Platt)
Thu Oct 12 00:03:38 1995
Date: Wed, 11 Oct 95 13:10:44 PDT
From: dplatt@3do.com (Dave Platt)
To: jbotz@orixa.mtholyoke.edu
Cc: ncr53c810@mroe.cs.colorado.edu, linux-scsi@vger.rutgers.edu
> Uh-huh. I've done more thorough tests... I hooked the disks
> up to a Mac and ran a SCSI utility called "Silver Lining" to do a
> surface analysis. The drives that Linux chokes on do have some
> errors, but minor ones... Silver Lining wouldn't even map out the
> blocks because it could recover from the errors.
My personal opinion: any blocks which are even marginally bad should
be reassigned. At the best, you're risking a degradation in drive
performance, since the drive may need to re-read a block several times
to get clean data. At the intermediate, you're risking an even worse
degradation in drive performance, if the Linux driver has to re-issue
the read/write command several times (and the driver will probably spit
a bunch of error messages to the log and/or console when this
happens). At the worse, you risk the loss of data, if the marginal
block becomes less reliable and the drive's ECC cannot recover data
from it.
> In short, there
> are problems, but the disks should be useable. Instead once Linux
> hits one of the bad blocks it does not recover and furthermore it
> then starts getting I/O errors on good blocks as well.
>
> Conclusion: the SCSI code has a bug (or bugs) that throw things into
> a bad state on certain disk errors that it *should* be able to recover
> from.
Hmmm. The code in scsi.c does allow for retries - in the case of operations
to SCSI disks, it looks as if 5 retries are permitted (according to the
MAX_RETRIES value in sd.c).
In scsi.c, there's a bunch of code in the check_sense() subroutine which
maps the drive's sense code into a suggested recovery action. The three
interesting ones seem to be SUGGEST_RETRY and SUGGEST_REMAP (which result
in a MAYREDO action, which will reissue the command unless the retry
count is exhausted) and SUGGEST_ABORT (which kills the command).
If the drive returns a sense code of MEDIUM_ERROR, then the check_sense
routine will SUGGEST_REMAP, and this will result in the command being
retried. If the drive returns a sense code of HARDWARE_ERROR, then
check_sense will SUGGEST_ABORT.
If you have a bad block which cannot be re-read successfuly after 5 retries,
then the retry logic will bail, and the command will fail.
I suggest that you enable some of the debugging logic in check_sense(), and
perhaps add some more, so that you can trace the actual sequence of sense
codes received and the SUGGEST_xxx actions recommended by check_sense().
You could also try upping the MAX_RETRIES limit in sd.c to 25 or so,
to see if this helps get past the bad spots.
There is another possibility here, which might be worth investigating.
Many SCSI disk drives can be configured to deal with recoverable errors
in either of two ways: [1] Do the necessary retries automatically, and
don't tell the host anything about them, or [2] Do the retries, return the
data, but report a completion error status which means "Hey, it took me several
tries and/or the use of ECC to get this data."
It is possible that your drives are configured (in their MODE SELECT pages)
to implement method 2. If so, it's possible that the Linux code is not
dealing gracefully with this situation. I don't _think_ this is the case...
the check_sense() routine responds to a RECOVERED_ERROR condition with
a SUGGEST_IS_OK, and this should result with a status=FINISHED and a
completion of the command.
Enable some of the debug stuff in scsi_done() in scsi.c, and see what
turns up...