[2218] in SIPB-AFS-requests

home help back first fref pref prev next nref lref last post

The death and rebirth of rosebud:/vicepb and rosebud:/vicepc

daemon@ATHENA.MIT.EDU (ghudson@MIT.EDU)
Sun Jan 7 10:18:44 1996

From: ghudson@MIT.EDU
Date: Sun, 7 Jan 96 10:18:31 -0500
To: sipb-afsreq@MIT.EDU

Summary: rosebud's Micropolis disk died, we're now using a replacement
disk Matt had for evaluation purposes (which we can keep), and we
didn't lose any data to speak of.  The replacement disk is at SCSI ID
3 instead of 0, but otherwise rosebud looks about the same as it did
before.

Here is a chronology of what happened to the sipb cell last night:

	* jhawk noted early Saturday that one of rosebud's disks was
	  making a loud noise.

	* Some time around 9pm, rosebud lost contact with its
	  Micropolis 4GB disk.  We experienced the usual symptoms of
	  this lossage, i.e. the bos server responds but the file
	  server does not.

	* I went into the machine room, shut down rosebud, verified
	  that it was the Micropolis disk making all that noise, and
	  restarted rosebud.  The Micropolis disk kept making noise
	  after it was restarted.  At this point, I theorized that it
	  was a fan noise and that we needed a new enclosure.  Indeed,
	  the air output from the enclosure seemed lower than the air
	  output from other enclosures' fans.

	* rosebud did not come back up, having lost contact with its
	  Micropolis disk at some point prior to full functionality.
	  jhawk and Karl noted that the noise coming from the disk
	  sounded more like a disk noise than a fan noise.

	* Me still theorizing that it was an enclosure noise, Karl and
	  I stole from old-jason one of the big enclosures that looks
	  like a Vax, using the 3.5"-5.25" rack mount from
	  limekiller's floppy drive.  We moved the Micropolis disk in
	  there, and noted that it still made noise.  We judged this
	  as still an improvement over the old enclosure, since it
	  ventilated the disk better.  Meanwhile, ronald-ann had been
	  experiencing the usual failure mode when rosebud is down, so
	  I brought up rosebud without the Micropolis disk and shut
	  down its file server.

	* Matt brought over a 4GB evaluation disk (a 5400rpm Seagate)
	  to move the data onto.  We get to keep this disk, henceforth
	  known as "the replacement disk."  It was noted that the last
	  backup was done three weeks ago.

	* We brought up rosebud with the replacement disk and the
	  Micropolis disk, which still made a lot of noise but didn't
	  do anything bad at that point.

	* We observed that the replacement disk was a couple thousand
	  sectors smaller than the Micropolis disk, so we couldn't fit
	  both filesystems from the Micropolis disk onto the
	  replacement disk.  We decided to start by recovering the
	  first filesystem, so we created a partition on the
	  replacement disk and dd the data onto it.

	* The dd failed about 1GB in with an I/O error, and did so
	  again at the same place when we try with
	  "conv=noerror,sync".  jhawk came up with the idea of using
	  a partition on the source disk to skip over the previously
	  read data.  Experimentation reveals that we could sometimes
	  read the blocks which gave the I/O error and sometimes
	  couldn't; at a time when we could, we copied the rest of the
	  data.

	* We had trouble fsck'ing the filesystem on the target disk;
	  fsck complained that the superblock data didn't agree with
	  the first alternate.  "fsck -b32" didn't encounter any
	  problems (other than minor filesystem inconsistancies,
	  though. We were eventually able to get fsck to work, 
	  apparently after modifying the partition table with
	  chpt to do an unrelated test (I guess chpt updates the first
	  alternate).

	* Running the salvager on the new filesystem showed three
	  files lost as well as much of sipb.decmips, but the
	  remainder of rosebud:/vicepb appeared to be restored.

	* Our plan at this point was to dd the data from the second
	  partition onto a spare 2GB disk and vos move that back onto
	  the replacement disk.  Unfortunately, the 2GB disk that ops
	  has been using is too small by about a hundred sectors.

	* Karl moves some data off a 4GB disk (which we can't keep)
	  that he had in his office and brings it over.  While we were
	  waiting for Karl, we turned off the Micropolis disk, which
	  had been growing louder.  (We brought up rosebud again; at
	  this point we had it running a file server with no mounted
	  partitions.)

	* When Karl returned with the 4GB disk, we found that we
	  couldn't start the Micropolis disk.  It would make a very
	  loud noise when it started up, and would then spin down and
	  hang the SCSI bus if accessed.  There was much wringing of
	  hands.  We got the disk to boot by trying four different
	  orientations of the cluster and by jhawk pressing his hand
	  against the disk.  A superstitious belief developed that the
	  disk will spin down if it hears us say anything, so we
	  conducted all comunication for the next ten or fifteen
	  minutes by passing notes or typing on anxiety-closet.

	* We created a partition on the new disk (which turned out to
	  be exactly the same size as the Micropolis disk, so we could
	  even create the partition in the same sectors) and dd'd the
	  data from the second Micropolis partition onto the new disk.
	  fsck and the salvager found no problems.

	* We noticed a few volume headers on rosebud's root filesystem
	  in /vicepb, /vicepc, and /vicepd.  It turns out that "vos
	  release" can apparently create volume headers in directories
	  where the file server doesn't believe there's a partition.
	  We nuked these volumes by running the Ultrix fsck on
	  /dev/rrz2a.

	* We had everything restored, and were starting to clean up,
	  when rosebud's SCSI bus started hanging whenever we accessed
	  Karl's 4GB swap disk.  There was much wringing of hands.  We
	  replaced a bunch of things on the SCSI bus and otherwise
	  fiddled around, but it didn't help.

	* We got Karl's swap disk to come up properly on
	  w20-spare-dec, so we started a file server there and vos
	  move'd the data back onto rosebud2:/vicepc.  (The vos move
	  is almost complete at the current time; when it's complete,
	  we'll take down w20-spare-dec.)

	* Karl is starting a backup shortly.

The whole process took about thirteen hours, with the cell
inaccessible during most of that time.



home help back first fref pref prev next nref lref last post