[802] in SIPB-AFS-requests

home help back first fref pref prev next nref lref last post

Word of Caution: Damage Recovery

daemon@ATHENA.MIT.EDU (Richard Basch)
Tue Sep 29 06:15:17 1992

Date: Tue, 29 Sep 92 06:14:44 -0400
To: sipb-afsreq@MIT.EDU
From: "Richard Basch" <basch@MIT.EDU>


Pursuant to an incident with the SIPB cell where some data was lost, I
feel that I should probably post these words of caution.

Whenever a volume error occurs, such as:

	Possible communication failure.
	vos: Could not end transaction on volume ##########

or any other error occurs (with very few exceptions), you should
consider the filesystem to be corrupt.

The FIRST course of action should be to salvage the volume (or its
parent, if it is a clone) or the partition.  You should NEVER attempt to
correct the problem with "vos zap" or any other "vos" command that will
affect the reference counts on a volume.  One possibility for how the
corruption might have occurred is that there is a bad vnode, over which
things may have tripped, and the other reference counts in the volume
may be incorrect.  Courses of action other than "salvaging" can often
result in significant data loss.

In addition, here is another procedure that I stumbled across, for a
particular failure mode... this was sent to "op" a while ago.

-R

[5037] daemon@ATHENA.MIT.EDU (probe@MIT.EDU)  Ops_Projects  08/27/92 23:14 (41 l
ines)
Subject: More on bad volumes.
From: probe@MIT.EDU
Date: Thu, 27 Aug 92 23:13:41 -0400
To: op@MIT.EDU

There is a common failure mode dealing with .backup volumes.
If you try salvaging a volume and it says:
        No vice inodes associated with volume
or
some other message indicating that there is NO data in the volume, and that
only a volume header remains (sometimes it will say it is deleting the volume
header), there is the potential for the volume still to be corrupt.

To fix this problem (once the salvager has been run), log into the AFS
server as root and remove the volume header that it claims was bad.

Then, do "vos remove <server> <part> <backup-volid or volume.backup>"
The vos operation will say that the volume doesn't exist on the server,
but the point of the operation is to invalidate the backup volume in the
VLDB.

After that, you can then do "vos backup volume -v".

Another indication that there may be a problem is if you do:
        vos backup <volume> -v
and only one line appears (of the following form)
        Created backup volume <volid>... done
There should be TWO lines, not one...
        Creating a new backup clone.
        Created backup volume ...
or
        Re-cloning volume ...
        Created backup volume ...

If you omit the -v option, you cannot distinguish between success and
failure.  There were a couple bad volumes, and unfortunately "vos" was
not catching this failure mode (and some volumes that had supposedly
been fixed after cloning warnings were then in a state where the cloning
system did not notice the errors with the volumes).  The bug with "vos"
has been fixed in "afsdev".

home help back first fref pref prev next nref lref last post