[304] in Tooltime

home help back first fref pref prev next nref lref last post

scopus test database status

daemon@ATHENA.MIT.EDU (Steven Wade Neiterman)
Mon Oct 28 18:41:09 1996

Date: Tue, 29 Oct 1996 06:43:06 +0100
To: tooltime@MIT.EDU
From: Steven Wade Neiterman <wade@MIT.EDU>

Just in case you would like to learn to repair server machines.   There
will be a quiz on the /u1, /u2, /u3 and /u4 drives at the next tooltime
meeting.

..Steve

-------------------------------------------------------------------------

Date: Mon, 28 Oct 1996 18:17:30 -0500
From: Ted McCabe <ted@MIT.EDU>
To: wade@MIT.EDU
Subject: Re: help-them lost another disk

Short status:
Work on help-them is still in progress.  It looks like the motherboard
might need replacing, but some tests still need to be done tomorrow.

/etc/group was garbled, perhaps by user error, and was preventing a
successful reboot when the hardware problems were worked around.  I
retyped the file by hand, copied from the stock install - which worked
well enough - to get the machine up.

Long status (for the curious):
I talked with Wade and we decided that best course of getting
help-them back on line was to swap /u4 in as /u1 and put a new 1G as
/u4, then restore the tar file that was 0'd on /u3 to get a consistent
snapshot of the databases.  This was since the most recent tape backup
of /u1 probably was after the disk started to fail, but the tape
backup of the tar file on /u3 should be good.

After I did the disk swap I noticed I was still getting SCSI transport
errors during fsck.  After some difficulty interpreting the info I
learned that the errors were associated with /u2.  The errors were
non-fatal and intermittent.  I swapped in a different 1G for /u2 and
the errors went away.

Subsequent testing found that the original /u2 disk worked fine on the
second scsi chain, but not on the first chain.  I suspect that
something on the first chain has started to corrupt the disks and has
made the /u2 disk sensitive to the first chain.

As another problem, I found that the file /etc/group was garbled,
looking like a /.klogin file.  Note: the internal disk is on the first
scsi chain, but this problem could have been the result of some user
error.  I typed this in by hand, but

The current state of help-them is that the /u2 disk is now on the
second scsi chain and the new 1G /u4 is on the first (I altered
/etc/vstab to reflect this).  The old /u4 is now /u1, where /u1 was on
the first scsi chain.

Future testing will involve swapping the cables out with /u2 back in
its original spot.  Errors occurring then point the fault at the
motherboard.  I've sent e-mail to determine if help-them is on
maintenance, if so we can swap the motherboard with one of
our spares.

Once I'm sure the hardware is stable, I'll do the restore of the
database snapshot.

   --Ted



home help back first fref pref prev next nref lref last post