[186] in Athena_Backup_System
diaster (crippled) mode operation
daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Thu Feb 1 13:13:33 1996
From: Jonathon Weiss <jweiss@MIT.EDU>
To: athena-backup@MIT.EDU
Date: Thu, 01 Feb 1996 13:13:21 EST
Below is the text I have written for the ABS overview doc, describing
disaster mode operation. It will appear after the section on disaster
recovery. Please send me comments when you have a chance.
Jonathon
\section{Disaster Mode Operation}
Although there are plans for disaster recovery, there is always the
chance that some backup or restore operation will need to be done
before disaster recovery is complete. In order to deal with this
possibility the Athena Backup System will have a degraded mode of
operation that can be used even if the master is down. The degraded
mode of operation will use two programs, the tape slave, and a test
driver.
In the degraded mode of operation the tape slave will operate fairly
normally. It will talk to the test driver as though the driver was
the master. If a backup needs to be done, the tape slave will get the
necessary information to do the backup from the driver, and report the
results back to it. Likewise, if a restore needs to be done, the tape
slave will get the information needed to do the restore from the
driver, and report when the restore is complete.
During degraded operation, the test driver needs to imitate the
master, at least as far as the tape slave is concerned. Since the
test driver will be designed to test the tape slave it will already
know how to make the appropriate RPC calls. The difficult part will
be collecting all of the information necessary for the tape slave to
perform the desired task. The person that is using the test driver
will be required to provide most of the information by hand, although
the test driver may be able to determine some things dynamically, like
the current location of a volume. For instance, if a restore is
necessary, the operator will have to look through an ascii dump of the
database, or something similar, in order to know on which tape and
where on the tape the data in question can be found.
We must accept the fact that when we are running in disaster mode, and
when we are loading tapes made in disaster mode into the database, we
are more vulnerable to accidental data loss than we are normally. In
order to minimize this risk, the test driver should not allow any tape
to be written unless it was just labeled. However, this is not
sufficient, we must still label the tape uniquely. Generating a
unique internal label is fairly easy. The internal tape ID will
contain a timestamp, precise to milliseconds, and possibly an
indicator that the tape was written in disaster mode. Since tapes
that are already labeled cannot be relabeled this should prevent the
overwriting of any existing backups. Choosing a unique external label
is much more difficult, since we have no absolute way to determine
what external labels are in use. When labeling a tape in disaster
mode the operator will be asked for an external label to use, and it
will not be checked to determine whether it is already in use.. If
this tape is entered into the database after the system is restored,
and it has an external ID that conflicts with one already in the
system, the operator will be asked to change the external label on the
new tape.
In order for backups made in disaster mode to be really useful, there
must be a way to load them into the database once it is operating
again. In order to facilitate this, there will be a tool for loading
the output of a tape scan into the database. For convenience, when a
backup is done in disaster mode, the test driver will create a log of
what was put on the tape. This log will have the same format as the
output of a tape scan, so only one utility will be needed to deal with
both of them.