[1584] in SIPB-AFS-requests
SIPB cell AFS outage
daemon@ATHENA.MIT.EDU (ghudson@MIT.EDU)
Sun Nov 13 13:29:07 1994
From: ghudson@MIT.EDU
Date: Sun, 13 Nov 94 13:21:16 -0500
To: sipb-afsreq@MIT.EDU
Cc: sipb-all@MIT.EDU, wdc@MIT.EDU, probe@MIT.EDU, ops@MIT.EDU
This morning, rosebud crashed due to an as-yet-undetermined hardware
problem. This caused the SIPB cell to go into an AFS failure mode
where clients would hang on all accesses to ronald-ann's file server,
and not time out.
This is not documented behavior; my best theory is that because
ronald-ann does not have quorum, the file server times out trying to
contact a protection server. However, the AFS kernel is still
responding to RXpings, and therefore client machines do not time out.
Poor design.
Being aware of this, as soon as I got into the office (about 11:00), I
did a bos shutdown of ronald-ann to allow clients to time out.
mkgray, jhawk, jmmikel, and I then went into the machine room and
fiddled with rosebud, trying to determine what was wrong (it doesn't
boot; it gives a failure mode of 229, which is, roughly speaking, "IPL
device list null or no devices in NORMAL mode"). We were unable to
make the diagnostic disks useful.
At about 12:45, we tried restarting ronald-ann to see if it would
serve files normally if it was started up with only one server active.
Instead, after the file server went up, it went into the same failure
mode (clients would hang on ronald-ann fileserver accesses), so we
brought it down again after about ten minutes.
Following some of the experience gained from the last rosebud failure,
we did a "bos removehost" to remove rosebud from ronald-ann's
CellServDB (hi, Jeff), and restarted ronald-ann. ronald-ann is now up
and serving files normally. This restored service to most of the cell
(the outland locker appears to be the most visible lossage).
HOWEVER, we should be absolutely certain not to let rosebud come up
fully without anyone watching. I believe that part of the problem
with the last rosebud failure was caused by this happening.
Sam and some others are still trying to get diagnostics on rosebud.
We promise not to run "makevg". Someone who knows a bit more than
anyone here about making service calls to IBM should probably call a
service rep, and make sure that someone is here when we arrange for
the call.