[2254] in SIPB-AFS-requests
Backup and upgrade post-mortem
daemon@ATHENA.MIT.EDU (ghudson@MIT.EDU)
Mon Jan 29 04:31:54 1996
From: ghudson@MIT.EDU
Date: Mon, 29 Jan 96 04:31:37 -0500
To: sipb-afsreq@MIT.EDU
Cc: afsdev@MIT.EDU
This was not a very successful AFS day, but everything is stable for
now. People on afsdev should skip to the part after the backup
problem (beginning with "At about 3:15am").
At about 5pm, I started a backup of the AFS cell. (I lost the logs
for the cloning because I wasn't careful and /tmp is a symlink to
/var/tmp on Athena Decstations.) It was interrupted at about 8:00pm
by the X server on opus crashing. (I was not using screen.) About
1.8GB of data had been written to sipb.full_3.1.
From 8-10pm, I assembled a list of volumes that had been modified on
or since January 22 (the date of the last backup), which totalled 4GB,
about .7GB of which had been backed up on sipb.full_3.1. I did a
backup of the other 3.3GB on a disk which is now in a case marked
"recovery tape for 1/28/96 SIPB AFS backup failure," using a volume
set called "recover". (The tape itself does not have a label on it,
on the theory that it can be reused after a week or so.) The backup
took until about 2am, and apparently died with an error at the end for
no obvious reason:
backup> j
Job 0 dump (recover.recover): 133612120 bytes transferred from volume project.netbsd.dev.backup; Processed 96/104 volumes
[Time passes.]
backup> backup: Failed to check dump status
backup: server or network not responding ; Will schedule a retry
backup: Failed to check dump status
backup: No dump task with specified ID ; Hard failure; aborting dump
backup: error in aborting dump recover.recover
j
backup> j
backup> quit
At about 3:15am, I performed the AFS upgrade, according to plan
(except that I remembered to restart the bos servers before restarting
the cell). Volume attaching seemed to take place very quickly (about
twelve seconds per partition). We started getting a lot of log
messages in FileLog of the form:
Mon Jan 29 03:21:06 1996 Warning: GetHostCPS failed (5376) for 18.71.0.55; will retry
(with different hosts). It is theorized that this is an error having
to do with the fileservers being unable to contact the ptservers for
IP-based acl information. Within several minutes, clients began
hanging, and we discovered that the ptservers and bosservers on both
server machines were no longer running. There were fileservers and
vlservers running on both machines, but unsurprisingly, the
fileservers weren't serving anything (presumably for lack of ptservers
to talk to).
After consulting with Matt Braun, we killed the remaining AFS
processes on rosebud and restarted the bosserver. While rosebud was
salvaging, ronald-ann's fileserver started serving again
(unsurprising; it had a ptserver to talk to). After rosebud finished
salvaging, we killed the AFS processes on ronald-ann and restarted the
bosserver.
The salvage logs for both servers came up clean (a lot shorter than
with 3.2, and salvaging didn't seem to take as long), and we did not
get any GetHostCPS log messages when we restarted. Everything appears
to be running normally now.