[3036] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

Re: RAID & unhappy scsi driver

daemon@ATHENA.MIT.EDU (linas@linas.org)
Mon Jan 5 18:06:44 1998

To: dledford@dialnet.net (Doug Ledford)
Date: 	Mon, 5 Jan 1998 18:03:09 -0600 (CST)
Cc: linas@linas.org, linux-raid@vger.rutgers.edu, linux-scsi@vger.rutgers.edu,
        linux-eata@trudi.zdv.Uni-Mainz.DE, linas@fc.net
In-Reply-To: <XFMail.980105154739.dledford@dialnet.net> from "Doug Ledford" at Jan 5, 98 03:32:07 pm
From: linas@linas.org

It's been rumoured that Doug Ledford said:

> >This may be a case for maintaining a permanent initrd for mission-critical
> >servers.  Details elude me, but at least a kernel & some network utils 
> >should be kept there.

So you are saying an initrd would *not* save me?
> 
> To be quite honest, the failure mode you saw (the drive going totally out to
> lunch) isn't a real common error mode.  It usually indicates that the error
> on the drive itself was so severe that the firmware didn't know what to do
> or it is a bug in the firmware.  As I recall, you said this was a Seagate
> disk.  I'm a little suprised that Seagate, who usually has above par
> firmware, would have this problem.  But, this falls in the category of
> catastrophic memory bus failure.  Unless you build complete system
> redundancy, you won't get around this problem. 

Well,  actually ... the reason I want with s/w raid is that I was 
hoping to build a dual-cpu system, with one scsi buss attached to
two servers.  I haven't yet gotten anywhere with this.

While I have your ear, a couple o quickie questions:

-- Having two cpu's accessing the bus at the same time should be OK, 
   as long as they are not accessing the same partitions, right?
   That is, the only reason to avoid dual access is to not
   mangle a file system, right?

-- the bus-busy signal wire can be cleared with a bus reset, right?
   I figure that if the main server goes down, its power will be 
   cut with some dead-man switch, and the other CPU can take over.
   I was not anticipating power-cycling the disk enclosure.
   
   I am concerned that the failure scenario above is capable of knocking 
   out both CPU's.

> >I'm wondering if it may be possible to keep a history of drive access
> >patterns 
> >shortly after a bus reset.  Through automatic analysis of the failures and
> >lessons learned through subsequent failures & resets, it might be possible
> >to locate the misbehaving drive.   I don't know enough about the susystem
> >to understand how practical this would be; I have written automatic
> >error-recovery 
> >code several times, and in each case, it snowballed into a highly
> >intertwined 
> >mess that always seemd to have some bug that got triggered less and less
> >often
> >with each patch. 
> 
> This sounds about right.  However, as you mentioned in another email, it
> says something about linux that we know have to worry about these types of
> things.  It hasn't been something I've worried about in the past.  However,
> this issue has given me an idea.  In my current working code, I'm keeping
> track of the currently active commands on the bus.  It would be an almost
> trivial thing to add a small section of code so that on a second bus reset
> that appears to be a repeat issue, we switch from normal command mode to
> single command at a time mode (I already maintain my own internal queues,
> one for each device, one waiting queue, and one complete queue).  I could
> simply watch the number of globally active commands after a repeat reset
> condition, hold it down to one command at a time, and wait for the reset to
> occur again (given a command limit of course).  Once we get another reset,
> the only active command on the bus (and it's target) should be the culprit,
> so I could specifically tell the mid level code that the command in question
> should be entirely aborted instead of re-tried.  Hell, with a single command
> at a time, I don't even need to wait for the reset, I can get pre-emptive on
> the abort cycle.  I could then set either a counter or a timer, and after x
> commands or x seconds, if nothing has happened again, start opening the bus
> back up to full speed.  That might allow me (and the mid level code) to
> track down the defective drive and get on with life.

That would be very nice!

--linas 


home help back first fref pref prev next nref lref last post