[1131] in SIPB-AFS-requests
[yandros@Athena.MIT.EDU: AFS mess timeline]
daemon@ATHENA.MIT.EDU (Matt Braun)
Fri Aug 27 16:14:56 1993
To: sipb-afsreq@mit.edu
Date: Fri, 27 Aug 93 16:14:25 EDT
From: Matt Braun <mhbraun@mit.edu>
Just for the record...
------- Forwarded Message
Received: from ATHENA-AS-WELL.MIT.EDU by po7.MIT.EDU (5.61/4.7) id AA05439; Sun, 22 Aug 93 23:18:19 EDT
Received: from OLIVER.MIT.EDU by Athena.MIT.EDU with SMTP
id AA10274; Sun, 22 Aug 93 23:18:05 EDT
From: yandros@Athena.MIT.EDU
Received: by oliver.MIT.EDU (AIX 3.2/UCB 5.64/4.7) id AA17774; Sun, 22 Aug 1993 23:17:59 -0400
Date: Sun, 22 Aug 1993 23:17:59 -0400
Message-Id: <9308230317.AA17774@oliver.MIT.EDU>
To: mhbraun@Athena.MIT.EDU, warlord@Athena.MIT.EDU, mhbraun@Athena.MIT.EDU,
mkgray@Athena.MIT.EDU, jweiss@Athena.MIT.EDU, tlyu@Athena.MIT.EDU,
yandros@Athena.MIT.EDU
Subject: AFS mess timeline
X-Orgs: MIT SIPB VWA LCS L2k
After (too much) waiting, here it is:
(~yandros/project/sipb/afs-mess.timeline)
Outages:
--------
- 36 minutes
- We don't know why the outage happened
- We were led to believe that it would `Just Work'
- We were led to believe that the failure mode would be a 5-minute
timeout at most
Problems:
---------
- Communcations failure
- People doing things they didn't know how to do
- No one familiar enough with hardware (RS/6000)
- CellServDB changed without anyone knowing (actually, bos listhost lies)
- AFS lost (and we don't know why)
Timeline:
---------
=>Thursday, 9 AM<=
- BA showed up at the SIPB office to run diagnostics
- Marc edited both machine's CellServDB's, but did not bos remove
- Chad took down rosebud.
- BA ran diagnostics
- BA brought rosebud back up. (NO ADMIN PRESENT) It had:
Consisntent CellServDB's (only ronald-ann)
DB processes running, but was not in the cell
- after that, nobody knows.
=>Friday, 9 AM<=
- Chaos in w20-5xx; BA shows up to replcae disk, Greg locked out of
office, combo given to Greg remotely, BA gets to rosebud
- Status:
Rosebud's CellServDB is *UNKNOWN*
Ronald-ann CellServDB has *BOTH* machines
- Matt halted rosebud
- clients wedged (for more than 5 minute maximum AFS timeout)
- Jeff unhappy
- Matt came to w20 and rebooted ronald-ann
- Clients *still* wedged
- ~60 seconds pass
- Matt reboots ronald-ann again. It does NOT have rosebud in CellServDB.
- Clients unwedged
- Ronald-ann is up, the cell is readwrite
- BA finishes installing new disk
- a boot is tried immediately; it fails
- attempt to install AIX 3.1 from AIX 3.2 install media FAILS
- attempt to install AIX 3.1 from AIX 3.1 install media FAILS
- Matt, unsure about AIX 3.2/AFS 3.2 combination, powers down rosebud
until Richard can be consulted
- Richard confirms use of AIX 3.2 with afsdev AFS 3.2 server binaries
- Tom succeeds in installing AIX 3.2 and AFS 3.2 on rosebud (afsdev
binaries, mkserv afs)
- Tom accidentally `makevg's rosebud instead of `importvg' *AFS data
lost*
- Jeff arrives and determines the makevg/importvg error; says that the
data is lost and will have to be restored from backups.
- Jeff hits a higher interrupt (Karen :-)
- Marc recreates volume groups on rosebud and makes the AFS volumes
- Status: no AFS processes on rosebud
- rosebud's CellServDB is created with *BOTH* machines
- Marc announces outage, finished setting up rosebud as an AFS server
except for bos config (no database files). He then bos shutdown'd
ronald-ann, changed the CellServDB to include both machines, and brought
ronald-ann back up.
- Marc started rosebud's bosserver, created the ptserver and vlserver
- Marc `bos listhost's both machines which show *the same* machines in
the cell.
- Quorum elecetions happen. Rosebud (18.70.0.210) wins. Ronald-ann
(18.70.0.219) dumps databases to rosebud.
- Status:
databases are consistent (both from ronald-ann)
- Marc started fsserver
- Marc begins restoring volumes from tape
- Calvin takes over restoring volumes from tape
- Calvin finds that some of the restores fail `in fantastic ways'.
Volumes were being restored incorrectly, so he shut down the AFS
processes on rosebud and sent mail.
-
=>Saturday (not in stero :-)<=
- Derek arrives and wants to find out what's wrong
- Derek bos restarted rosebud's AFS processes
- Started restoring volumes. He had problems with getting duplicated
Volume ID's when attempting to restore multiple volumes on teh same
command line. He eventually restored all the volumes from tape.
- Derek tried to recreate the VLDB (both machines)
shut down vlservers
moved old database files out of the way
restarted vlservers
syncvldb's ronald-ann
syncvldb's ronald-ann again
syncvldb's rosebud (twice? I dunno -y)
vos examine'ing a volume restored from tape failed
vos listvol succeeded
- Derek checked the CellServDB manually:
rosebud: *both* machines
ronald-ann: *only ronald-ann*
- Derek rebuilt the VLDB:
changed the CellServDB's to both include both machines
stop'd the vlservers
removed the database files
restarted vlserver (on ronald-ann? -y)
syncvldb
restarted ptserver
- Derek then checked the restored volumes; the VLDB seemed fine
- Derek sent mail (I think -y) and left
There are some questions still, and some innacuracies, probably.
There are a couple points near the end that I think Derek can fill in
and then place in his homedir with the other writeups. If anyone else
should get this, feel free to send it; I'd forgotten where we decided
it should go.
chad
(who apologizes for being lame for so long)
------- End of Forwarded Message