[3042] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

Re: RAID & unhappy scsi driver

daemon@ATHENA.MIT.EDU (Doug Ledford)
Tue Jan 6 01:59:57 1998

In-Reply-To: <199801060003.SAA21084@shadygrove.linas.org>
Date: 	Tue, 06 Jan 1998 00:13:26 -0600 (CST)
From: Doug Ledford <dledford@dialnet.net>
To: linas@linas.org
Cc: linas@fc.net, linux-eata@trudi.zdv.Uni-Mainz.DE,
        linux-scsi@vger.rutgers.edu, linux-raid@vger.rutgers.edu


On 06-Jan-98 linas@linas.org wrote:
>Well,  actually ... the reason I want with s/w raid is that I was 
>hoping to build a dual-cpu system, with one scsi buss attached to
>two servers.  I haven't yet gotten anywhere with this.

Ummm....this is bad.  Two machines attached to one SCSI bus gives a failure
mode possibility for taking out *both* machines with one bad device.  That's
the drawback to the SCSI bus.  It's a shared bus and a device on that bus
can render the whole damn thing inoperable.  More to the point, what I was
getting at with my original statement, was if you truly want a *real*
mission critical, can't be tripped up by nuclear war, couldn't stop it with
ten speednign bullets, and static electricity is your friend type computing
system, then you need to take a few hints from things like some of the Sparc
Enterprise servers.  They've built machines that can have whole CPU rakcs
hot swapped without ever rebooting.  The same goes for your SCSI bus.  When
people first started designing nuclear devices for modern missiles, there
were two concerns, reliability, and no false explosions.  These two goals
had two separate rules of thumb.  To increase reliability, you design
systems that are parallel and separate.  To increase safety, you design fail
safe circuits that are in series.  This created quite a mess of circuits,
but the rule of thumb about reliability applies here.  If you want things
reliable, then always concentrate on adding more independant systems that
can operate without the others intact, not the other way around.

>
>While I have your ear, a couple o quickie questions:
>
>-- Having two cpu's accessing the bus at the same time should be OK, 
>   as long as they are not accessing the same partitions, right?
>   That is, the only reason to avoid dual access is to not
>   mangle a file system, right?

More or less, the aic7xxx driver should fairly well handle reservation
conflicts and busy devices.  However, as Leonard pointed out, you can have
both machines access the same partition if the are read only.  I would
imagine that if you could disable read caching on a device, you could have
multiple machines access the same device read only and have one machine
access the device read/write for updates without problem as well.

>-- the bus-busy signal wire can be cleared with a bus reset, right?
>   I figure that if the main server goes down, its power will be 
>   cut with some dead-man switch, and the other CPU can take over.
>   I was not anticipating power-cycling the disk enclosure.

Unfortunately, no.  It isn't that easy.  In this case, the Busy pin on the
SCSI bus is a shared signal pin.  Its active state is defined as active
*LOW* meaning 0V nominal, but actually, anything under about 2V should
trigger devices into thinking the bus is busy).  Any device is allowed to
pull this pin low (active) in order to signal busy.  The power to drive this
pin high (inactive) is provided by the SCSI terminators.  For certain
failure modes, power cycling won't even help.  For example, if one of your
drives develops a physical short between the busy signal pin and ground,
that pin is going low and it wont go high again until you disconnect that
device from the bus, period, end of discussion, system dead.  You could
power cycle things until you're blue in the face and it wouldn't help.  So,
knowing this information, you can see, a single SCSI bus is *NOT* adviseable
for a truly failsafe operation.

>   I am concerned that the failure scenario above is capable of knocking 
>   out both CPU's.

It most certainly is.

>That would be very nice!

I thought you might like that :)


----------------------------------
E-Mail: Doug Ledford <dledford@dialnet.net>
Date: 06-Jan-98
Time: 00:13:26
----------------------------------

home help back first fref pref prev next nref lref last post