[1131] in SIPB-AFS-requests

home help back first fref pref prev next nref lref last post

[yandros@Athena.MIT.EDU: AFS mess timeline]

daemon@ATHENA.MIT.EDU (Matt Braun)
Fri Aug 27 16:14:56 1993

To: sipb-afsreq@mit.edu
Date: Fri, 27 Aug 93 16:14:25 EDT
From: Matt Braun <mhbraun@mit.edu>


Just for the record...


------- Forwarded Message

Received: from ATHENA-AS-WELL.MIT.EDU by po7.MIT.EDU (5.61/4.7) id AA05439; Sun, 22 Aug 93 23:18:19 EDT
Received: from OLIVER.MIT.EDU by Athena.MIT.EDU with SMTP
	id AA10274; Sun, 22 Aug 93 23:18:05 EDT
From: yandros@Athena.MIT.EDU
Received: by oliver.MIT.EDU (AIX 3.2/UCB 5.64/4.7) id AA17774; Sun, 22 Aug 1993 23:17:59 -0400
Date: Sun, 22 Aug 1993 23:17:59 -0400
Message-Id: <9308230317.AA17774@oliver.MIT.EDU>
To: mhbraun@Athena.MIT.EDU, warlord@Athena.MIT.EDU, mhbraun@Athena.MIT.EDU,
        mkgray@Athena.MIT.EDU, jweiss@Athena.MIT.EDU, tlyu@Athena.MIT.EDU,
        yandros@Athena.MIT.EDU
Subject: AFS mess timeline
X-Orgs: MIT SIPB VWA LCS L2k


After (too much) waiting, here it is:
(~yandros/project/sipb/afs-mess.timeline)

    Outages:
    --------
  
    - 36 minutes
    - We don't know why the outage happened
    - We were led to believe that it would `Just Work'
    - We were led to believe that the failure mode would be a 5-minute
      timeout at most
  
    Problems:
    ---------
  
    - Communcations failure
    - People doing things they didn't know how to do
    - No one familiar enough with hardware (RS/6000)
    - CellServDB changed without anyone knowing (actually, bos listhost lies)
    - AFS lost (and we don't know why)
  
    Timeline:
    ---------
  
    =>Thursday, 9 AM<=
  - BA showed up at the SIPB office to run diagnostics
  - Marc edited both machine's CellServDB's, but did not bos remove
  - Chad took down rosebud.
  - BA ran diagnostics
  - BA brought rosebud back up.  (NO ADMIN PRESENT) It had:
      Consisntent CellServDB's (only ronald-ann)
      DB processes running, but was not in the cell
  - after that, nobody knows.
  
    =>Friday, 9 AM<=
  - Chaos in w20-5xx; BA shows up to replcae disk, Greg locked out of
    office, combo given to Greg remotely, BA gets to rosebud
  - Status:
      Rosebud's CellServDB is *UNKNOWN*
      Ronald-ann CellServDB has *BOTH* machines
  - Matt halted rosebud
  - clients wedged (for more than 5 minute maximum AFS timeout)
  - Jeff unhappy
  - Matt came to w20 and rebooted ronald-ann
  - Clients *still* wedged
  - ~60 seconds pass
  - Matt reboots ronald-ann again.  It does NOT have rosebud in CellServDB.
  - Clients unwedged
  - Ronald-ann is up, the cell is readwrite
  - BA finishes installing new disk
  - a boot is tried immediately; it fails
  - attempt to install AIX 3.1 from AIX 3.2 install media FAILS
  - attempt to install AIX 3.1 from AIX 3.1 install media FAILS
  - Matt, unsure about AIX 3.2/AFS 3.2 combination, powers down rosebud
    until Richard can be consulted
  - Richard confirms use of AIX 3.2 with afsdev AFS 3.2 server binaries
  - Tom succeeds in installing AIX 3.2 and AFS 3.2 on rosebud (afsdev
    binaries, mkserv afs)
  - Tom accidentally `makevg's rosebud instead of `importvg'  *AFS data
    lost*
  - Jeff arrives and determines the makevg/importvg error; says that the
    data is lost and will have to be restored from backups.
  - Jeff hits a higher interrupt (Karen :-)
  - Marc recreates volume groups on rosebud and makes the AFS volumes
  - Status: no AFS processes on rosebud
  - rosebud's CellServDB is created with *BOTH* machines
  - Marc announces outage, finished setting up rosebud as an AFS server
    except for bos config (no database files).  He then bos shutdown'd
    ronald-ann, changed the CellServDB to include both machines, and brought
    ronald-ann back up.
  - Marc started rosebud's bosserver, created the ptserver and vlserver
  - Marc `bos listhost's both machines which show *the same* machines in
    the cell.
  - Quorum elecetions happen.  Rosebud (18.70.0.210) wins. Ronald-ann
    (18.70.0.219) dumps databases to rosebud.
  - Status:
      databases are consistent (both from ronald-ann)
  - Marc started fsserver
  - Marc begins restoring volumes from tape
  - Calvin takes over restoring volumes from tape
  - Calvin finds that some of the restores fail `in fantastic ways'.
    Volumes were being restored incorrectly, so he shut down the AFS
    processes on rosebud and sent mail.
  -
    =>Saturday (not in stero :-)<=
  - Derek arrives and wants to find out what's wrong
  - Derek bos restarted rosebud's AFS processes
  - Started restoring volumes.  He had problems with getting duplicated
    Volume ID's when attempting to restore multiple volumes on teh same
    command line.  He eventually restored all the volumes from tape.
  - Derek tried to recreate the VLDB (both machines)
      shut down vlservers
      moved old database files out of the way
      restarted vlservers
      syncvldb's ronald-ann
      syncvldb's ronald-ann again
      syncvldb's rosebud (twice?  I dunno -y)
      vos examine'ing a volume restored from tape failed
      vos listvol succeeded
  - Derek checked the CellServDB manually:
      rosebud:  	*both* machines
      ronald-ann: *only ronald-ann*
  - Derek rebuilt the VLDB:
      changed the CellServDB's to both include both machines
      stop'd the vlservers
      removed the database files
      restarted vlserver (on ronald-ann? -y)
      syncvldb
      restarted ptserver
  - Derek then checked the restored volumes; the VLDB seemed fine
  - Derek sent mail (I think -y) and left

There are some questions still, and some innacuracies, probably.
There are a couple points near the end that I think Derek can fill in
and then place in his homedir with the other writeups.  If anyone else
should get this, feel free to send it; I'd forgotten where we decided
it should go.

chad
(who apologizes for being lame for so long)

------- End of Forwarded Message


home help back first fref pref prev next nref lref last post