[2218] in SIPB-AFS-requests
The death and rebirth of rosebud:/vicepb and rosebud:/vicepc
daemon@ATHENA.MIT.EDU (ghudson@MIT.EDU)
Sun Jan 7 10:18:44 1996
From: ghudson@MIT.EDU
Date: Sun, 7 Jan 96 10:18:31 -0500
To: sipb-afsreq@MIT.EDU
Summary: rosebud's Micropolis disk died, we're now using a replacement
disk Matt had for evaluation purposes (which we can keep), and we
didn't lose any data to speak of. The replacement disk is at SCSI ID
3 instead of 0, but otherwise rosebud looks about the same as it did
before.
Here is a chronology of what happened to the sipb cell last night:
* jhawk noted early Saturday that one of rosebud's disks was
making a loud noise.
* Some time around 9pm, rosebud lost contact with its
Micropolis 4GB disk. We experienced the usual symptoms of
this lossage, i.e. the bos server responds but the file
server does not.
* I went into the machine room, shut down rosebud, verified
that it was the Micropolis disk making all that noise, and
restarted rosebud. The Micropolis disk kept making noise
after it was restarted. At this point, I theorized that it
was a fan noise and that we needed a new enclosure. Indeed,
the air output from the enclosure seemed lower than the air
output from other enclosures' fans.
* rosebud did not come back up, having lost contact with its
Micropolis disk at some point prior to full functionality.
jhawk and Karl noted that the noise coming from the disk
sounded more like a disk noise than a fan noise.
* Me still theorizing that it was an enclosure noise, Karl and
I stole from old-jason one of the big enclosures that looks
like a Vax, using the 3.5"-5.25" rack mount from
limekiller's floppy drive. We moved the Micropolis disk in
there, and noted that it still made noise. We judged this
as still an improvement over the old enclosure, since it
ventilated the disk better. Meanwhile, ronald-ann had been
experiencing the usual failure mode when rosebud is down, so
I brought up rosebud without the Micropolis disk and shut
down its file server.
* Matt brought over a 4GB evaluation disk (a 5400rpm Seagate)
to move the data onto. We get to keep this disk, henceforth
known as "the replacement disk." It was noted that the last
backup was done three weeks ago.
* We brought up rosebud with the replacement disk and the
Micropolis disk, which still made a lot of noise but didn't
do anything bad at that point.
* We observed that the replacement disk was a couple thousand
sectors smaller than the Micropolis disk, so we couldn't fit
both filesystems from the Micropolis disk onto the
replacement disk. We decided to start by recovering the
first filesystem, so we created a partition on the
replacement disk and dd the data onto it.
* The dd failed about 1GB in with an I/O error, and did so
again at the same place when we try with
"conv=noerror,sync". jhawk came up with the idea of using
a partition on the source disk to skip over the previously
read data. Experimentation reveals that we could sometimes
read the blocks which gave the I/O error and sometimes
couldn't; at a time when we could, we copied the rest of the
data.
* We had trouble fsck'ing the filesystem on the target disk;
fsck complained that the superblock data didn't agree with
the first alternate. "fsck -b32" didn't encounter any
problems (other than minor filesystem inconsistancies,
though. We were eventually able to get fsck to work,
apparently after modifying the partition table with
chpt to do an unrelated test (I guess chpt updates the first
alternate).
* Running the salvager on the new filesystem showed three
files lost as well as much of sipb.decmips, but the
remainder of rosebud:/vicepb appeared to be restored.
* Our plan at this point was to dd the data from the second
partition onto a spare 2GB disk and vos move that back onto
the replacement disk. Unfortunately, the 2GB disk that ops
has been using is too small by about a hundred sectors.
* Karl moves some data off a 4GB disk (which we can't keep)
that he had in his office and brings it over. While we were
waiting for Karl, we turned off the Micropolis disk, which
had been growing louder. (We brought up rosebud again; at
this point we had it running a file server with no mounted
partitions.)
* When Karl returned with the 4GB disk, we found that we
couldn't start the Micropolis disk. It would make a very
loud noise when it started up, and would then spin down and
hang the SCSI bus if accessed. There was much wringing of
hands. We got the disk to boot by trying four different
orientations of the cluster and by jhawk pressing his hand
against the disk. A superstitious belief developed that the
disk will spin down if it hears us say anything, so we
conducted all comunication for the next ten or fifteen
minutes by passing notes or typing on anxiety-closet.
* We created a partition on the new disk (which turned out to
be exactly the same size as the Micropolis disk, so we could
even create the partition in the same sectors) and dd'd the
data from the second Micropolis partition onto the new disk.
fsck and the salvager found no problems.
* We noticed a few volume headers on rosebud's root filesystem
in /vicepb, /vicepc, and /vicepd. It turns out that "vos
release" can apparently create volume headers in directories
where the file server doesn't believe there's a partition.
We nuked these volumes by running the Ultrix fsck on
/dev/rrz2a.
* We had everything restored, and were starting to clean up,
when rosebud's SCSI bus started hanging whenever we accessed
Karl's 4GB swap disk. There was much wringing of hands. We
replaced a bunch of things on the SCSI bus and otherwise
fiddled around, but it didn't help.
* We got Karl's swap disk to come up properly on
w20-spare-dec, so we started a file server there and vos
move'd the data back onto rosebud2:/vicepc. (The vos move
is almost complete at the current time; when it's complete,
we'll take down w20-spare-dec.)
* Karl is starting a backup shortly.
The whole process took about thirteen hours, with the cell
inaccessible during most of that time.