[8188] in linux-scsi channel archive


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Matthias Andree)
Wed Feb 23 13:59:14 2000

Date:   Wed, 23 Feb 2000 18:38:27 +0100
From: Matthias Andree <ma@dt.e-technik.uni-dortmund.de>
To: Ricky Beam <jfbeam@bluetopia.net>
Cc: linux-scsi@vger.rutgers.edu
Message-ID: <20000223183827.E4697@krusty.e-technik.uni-dortmund.de>
Mail-Followup-To: Ricky Beam <jfbeam@bluetopia.net>,
	linux-scsi@vger.rutgers.edu
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <Pine.LNX.4.04.10002221101160.12259-100000@beaker>; from jfbeam@bluetopia.net on Tue, Feb 22, 2000 at 11:20:12AM -0500

* Ricky Beam (jfbeam@bluetopia.net) [000222 17:20]:
> On 22 Feb 2000, Matthias Andree wrote:
> >What we see is an endless series of bus resets due to timeouts, domain
> >revalidations and aborts on the bus, bringing the intact drive away from
> >operation as well.
> 
> That's what you should see.  The driver has no way of knowing the drive
> has actually failed and thus will continue sending it commands.

Sure. It will continue sending commands, and it should be sending
commands. Commands happen to fail for broken blocks. So what. No reason
to bus reset. 

> >Later, I tried to analyze the drive using sformat. I had to abandon the
> >check since testing /dev/sdb was bringing /dev/sda down again. 
> 
> You'll have to check it from the BIOS.  OSes are very non-friendly
> towards bad drives.  I can send you a "quick and dirty" low-level
> format utility if you want.  (I wrote it a few years ago to erase
> all the drives in an eval NetApp prior to shipping it back. :-))

I have stuff enough to go, I still don't see the justification to mess
with bus resets. The bus is fine, the drive is bad. You don't stop a
train because one waggon has no lights. You lock that waggon and have
the train continue. 

> Assuming the drive is target 3 on scsi bus 0:
>   echo "scsi remove-single-device 0 0 3 0" > /proc/scsi/scsi

I tried that, but the drive stuck with me :-( 

> I'm not sure what happens if you do that to a mounted drive.  I've never
> tried that :-)  (And before we start another screaming session, I have
> drive cages DESIGNED for hot-plugging -- and the SCSI-3 spec clearly
> sets the rules for doing this.)

It's ignored if the drive is mounted and there is NO, ABSOLUTELY NO way
to get it unmounted or even mounted ro manually. If the recovery
behaviour "unmount on errors" has not been set in advance to trouble,
you need to reboot. 

> >There is no hope of bringing unflushed buffers back to a broken disk, it
> >would be no good anyhow.
> 
> There's a few (one?) ioctl for clearing all the buffers for a device.
> "hdparm" uses it and often destroys hard drives because of it :-)

I don't want to mess. I want a "force nuke drive from system". Force
unmount everything, kill the processes that tried to access that drive
and remove the drive from the system. 

> >It is unbearable it takes something like 75 minutes to get the system
> >down to reboot. 
> 
> Magic Sysrq -- [alt]-[sysrq]-[s]: Try to sync, [alt]-[sysrq]-[b]: Reboot NOW

Only when at the console. I was not. I could not have been even if I
tried to. 

> You have a broken drive on your bus. REMOVE IT.  It doesn't matter if you
> send it commands; it can still mess with the bus.  

The logic is fine. The actual drive is failing. I WAS trying to remove
it, I could not. No problem, /etc/fstab was edited in no time, but no
way to prevent the reboot to get the processes out of memory.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

[8188] in linux-scsi channel archive

Re: recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Matthias Andree)Wed Feb 23 13:59:14 2000

daemon@ATHENA.MIT.EDU (Matthias Andree)
Wed Feb 23 13:59:14 2000