[8151] in linux-scsi channel archive


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Ricky Beam)
Tue Feb 22 12:54:22 2000

Date:   Tue, 22 Feb 2000 11:20:12 -0500 (EST)
From: Ricky Beam <jfbeam@bluetopia.net>
To: Matthias Andree <ma@dt.e-technik.uni-dortmund.de>
Cc: linux-scsi@vger.rutgers.edu
In-Reply-To: <m34sb1cof4.fsf@emma1.emma.line.org>
Message-ID: <Pine.LNX.4.04.10002221101160.12259-100000@beaker>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

On 22 Feb 2000, Matthias Andree wrote:
>What we see is an endless series of bus resets due to timeouts, domain
>revalidations and aborts on the bus, bringing the intact drive away from
>operation as well.

That's what you should see.  The driver has no way of knowing the drive
has actually failed and thus will continue sending it commands.

>Later, I tried to analyze the drive using sformat. I had to abandon the
>check since testing /dev/sdb was bringing /dev/sda down again. 

You'll have to check it from the BIOS.  OSes are very non-friendly
towards bad drives.  I can send you a "quick and dirty" low-level
format utility if you want.  (I wrote it a few years ago to erase
all the drives in an eval NetApp prior to shipping it back. :-))

>I find that unbearable. Is there any way to prevent that a defective
>drive brings down the entire bus, that it is degraded and its
>performance fucked beyond all limits? 

Assuming the drive is target 3 on scsi bus 0:
  echo "scsi remove-single-device 0 0 3 0" > /proc/scsi/scsi

I'm not sure what happens if you do that to a mounted drive.  I've never
tried that :-)  (And before we start another screaming session, I have
drive cages DESIGNED for hot-plugging -- and the SCSI-3 spec clearly
sets the rules for doing this.)

>There is no hope of bringing unflushed buffers back to a broken disk, it
>would be no good anyhow.

There's a few (one?) ioctl for clearing all the buffers for a device.
"hdparm" uses it and often destroys hard drives because of it :-)

>It is unbearable it takes something like 75 minutes to get the system
>down to reboot. 

Magic Sysrq -- [alt]-[sysrq]-[s]: Try to sync, [alt]-[sysrq]-[b]: Reboot NOW

>I am annoyed by that Adaptec junk. Be it the 1542CF, be it the
>2940UW. Expensive and useless if something goes wrong.

Expensive: Yes.  ALL controllers will have a problem with bad/broken drives.
The controller sent a command to the drive that it didn't complete...

>Any way to prevent this that I have not seen? No, buying RAIDs is not an
>option. 

Why, some RAID controllers are cheaper than Adaptec SCSI cards :-)

...

You have a broken drive on your bus. REMOVE IT.  It doesn't matter if you
send it commands; it can still mess with the bus.  I have an IBM drive
whos mere presence on the bus crashes my Mylex RAID controller.  (Mylex
only points to the "Approved Drive List" and says the drive isn't certified.
No shit!  They only list _9_, yes NINE, drives certified for use with
that controller.)

--Ricky

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

[8151] in linux-scsi channel archive

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Ricky Beam)Tue Feb 22 12:54:22 2000

daemon@ATHENA.MIT.EDU (Ricky Beam)
Tue Feb 22 12:54:22 2000