[2262] in SIPB-AFS-requests

home help back first fref pref prev next nref lref last post

Status of the cell, and possible plans to revert to 3.2

daemon@ATHENA.MIT.EDU (ghudson@MIT.EDU)
Tue Jan 30 01:38:29 1996

From: ghudson@MIT.EDU
Date: Tue, 30 Jan 96 01:38:10 -0500
To: sipb-afsreq@MIT.EDU

We've acquired an Athena fileserver-configuration Sparc 5 (96MB RAM,
two SCSI busses) and put it in the machine room.  Its CPU is on top of
picayune, its disk (the 4GB Micropolis disk previously on hodge) is on
top of picayune's disks, and its monitor and keyboard are on top of
the nearby cart.

We've run the machine through the usual AFS install, and are in the
process of tightening up its security (see the files in
/afs/sipb/service).

In the meantime, before we can put the new AFS server into production,
we have to resolve a serious stability problem which has come up.
rosebud's bosserver, volserver, and ptserver have been dying rapidly
(three times now since the initial installation, the third time before
it finished salvaging from the second time).  The third time, we
decided to reboot rosebud.

The problem usually manifests itself first in terms of CPS errors.  To
recover from the problem when it happens:

	* kill -QUIT the fileserver process, and wait for it to die
	  (seems to be 10-20 seconds).

	* kill the vlserver process.

	* If this happens to ronald-ann, reboot it, since we haven't
	  tried that yet.

Both servers are running process accounting now, so the next time
either of them fails we can check how the bosserver exited.

Please comment if you object to the following plan for dealing with
the problem:

	* If rosebud continues to lose in this manner, we will back
	  out the cell to 3.2 tomorrow night, or during the day if it
	  gets really bad.

	* If ronald-ann loses in this manner, we will reboot it the
	  first time, and revert the cell to 3.2 if it continues to
	  lose (tomorrow night, or during the day if it gets really
	  bad).

	* If we learn something significant from the process account
	  information, or if we can find relevant changes between
	  /mit/opssrc/afs.33a and afs33a.no.8bit, we can act on that
	  knowledge rather than reverting the cell.

Following is my plan for reverting the cell to 3.2, if necessary:

	* Shutting down the cell
	* Killing the bosserver on both machines
	* Moving the bin.32 and db.32 directories into place, as well
	  as /etc/fsck.32, on both machines
	* Restarting the cell
	* Running vos syncvldb to sync the vldb with any changes made
	  since the upgrade

I don't believe this is any more dangerous than starting the servers
with empty vldbs, and will cause much less of an outage.

If we do revert the cell to 3.2, we will have to wing it as far as
putting the new AFS server into production.


home help back first fref pref prev next nref lref last post