[4] in Athena_Backup_System
Requirements document
daemon@ATHENA.MIT.EDU (jweiss@MIT.EDU)
Thu Jun 30 16:35:18 1994
From: jweiss@MIT.EDU
Date: Thu, 30 Jun 1994 16:35:14 -0400
To: athena-backup-mtg@menelaus.MIT.EDU
Athena Backup Requirements
Introduction
This document explains the requirements for a new Athena backup
system. The current system requires too much manual intervention and
a significant manpower investment to keep running. Perhaps most
importantly, the current system will require an approximately linear
increase in effort in order to add more backed-up on-line disk
storage. Parts of this document have been taken directly from Jeff
Schiller's notes.
This proposal articulates the requirements for a replacement system
that will scale better then the current system. The new system is also
designed to make use of staff at various skill levels and knowledge in
the course of operation. In other words, the people who will do most
of the day to day labor will not need to be either UNIX or AFS experts
to do their job.
Design Goals and Requirements
This proposal assumes that we have at least 4 distinct staff
operations involved in performing backups. These distinction are:
1) System Operator
The System Operator(s) will run the daily backup. The system will
request them to mount and store tapes by a unique label that appears
on the tape. System Operators will not need to be concerned about tape
errors (unless the tape physically gets entangled in the tape drive),
if a tape has too many errors, the system will automatically call for
a different tape and re-write the data from the first tape to it.
2) Retrieval Operator
The Retrieval Operator will be responsible for initiating the
technical commands to begin a retrieval request. The Retrieval
Operator will not need to know on which tape the required data is
stored. The system will determine this and request the appropriate
tape to be mounted.
3) Exception Handler
The Exception Handler will not operate the tape drives or otherwise
mount and store tapes. The Exception Handler will be a programming staff
member who receives reports from the backup system (perhaps via
electronic mail, zephyr, and/or syslog) that indicate how well the
system is operating. Specifically the Exception Handler is informed
of all tape errors (perhaps summarized or abbreviated to just include
errors which are "significant").
It will be the responsibility of the Exception handler to take the
appropriate corrective action when the system reports that a particular
tape is receiving too many errors. For example the exception handler
may have to replace tapes or call in hardware service on the tape
drives.
The Exception Handler will be responsible for ensuring that the backup
system is operating properly. This includes ensuring that day to day
backups are indeed happening and backing up the correct data.
4) The Wizard
Ideally this person will have nothing to do! However I include this
position with the recognition that at least at first, problems will
occur that require expert attention.
Although it is possible and likely that different individuals will
perform each of the above 4 roles, of course this is not necessary. The
tasks represent a logical grouping of responsibilities.
Vision
To better understand what this proposal is attempting to express, here
is the vision of how a daily backup will be performed.
The System Operator will login to each tape dump server (either
remotely, or locally) and initiate the backup by typing a single
command for each server (or drive). At this point the backup system
will determine what data needs to be backed up). After a period of
computation the System Operator will be prompted, by unique tape
label, for the first backup tape. The operator will then insert the
tape and tell the system he has done so. If the tape is not available
the operator will be able to tell the system that. This information
should be passed on to the exception handler. The system will reject
incorrect tapes (see section on Tape Labels)
When each tape is completed, the backup system will keep track of the
number of tape errors that occurred, and optionally make this
information available to the System Operator for informational
purposes. Note, that if an error occurs, the system will retry at
least once, before giving up on an entity. After each tape is written
the operator will be given the choice of starting another tape, or
leaving the drive idle.
A Word On Tape Labels
The current AFS (Transarc) Backup System reuses tape labels. A given
tape's label is a function of what backup it represents. Given that we
run the same backup script multiple times, we wind up with many tapes
that are magnetically labeled with the same tape volume id. It is up to
the person running backups to determine which tape should be used for a
backup. Put another way, it is up to the person running the backup to
decide when a tape is "old" and may safely be reused.
In this proposal every tape will have a unique magnetic label. Which
will also be duplicated on the physical tape label. This label may be
re-used only if the first tape with a given label has been destroyed.
In addition there will be another tape ID that is magnetically written
on the tape, and stored in the system. This ID will be unique across
all tapes in a system, regardless of pool, tape type, or anything
else. This ID will NEVER be reused. The idea behind this dual id
system is that the former will make filing of the tapes easy, and the
latter will guarantee that the correct tape is in the drive. However,
no one will ever have to look at the second ID. The backup system
will maintain a database of all labeled tapes along with when they
were last written, and with what they were written.
The backup system itself will choose which tape will be overwritten
during a given backup and ask for it to be mounted. At tape mount
time the tape will be read to determine it's label and only if the label
matches the tape requested will it be written. A new tape, that does
not contain any label (i.e., blank) will also be accepted, provided that
the database knows that the tape has never been written on before.
There may be multiple groups or "pools" of tapes. In different pools,
there may be tapes with the same ID (this is the first ID discussed
above). This will allow support for multiple tape pools, which may be
useful in supporting anything from physically distant backup servers
to different types of tape media. However, such features may not be
supported in early versions of the system.
Major System Modules
The system will be broken down into several major modules. Each module
implements a separable function within the system. Some of these modules
can be replaced and updated as the AFS system itself is evolved (or is
replaced with DFS).
One of the system dependent functions will be the definition of a
"backed up object." A backed up object is an object that is backed up
as a whole. It is also an object which has a modification date which
can be checked to see if it is more recent then a current backed up
version. In AFS the logical candidate for backed up object is the AFS
volume.
Interfaces
System operator: Since the system operator's role will be a simple one, it
will not be necessary to provide a complex interface. Therefore, the
interface will be a text based command-line one. One instance
of this interface will be capable of managing all tape drives on a
given tape dump server. This will allow the operator to walk up to
the (non-X) console of a tape dump server, and be able to manage all
of the drives on that machine. This interface will have to run as a
process on the tape dump server, although it will be possible for
the system operator to log into the tape dump server remotely to run
the process.
Retrieval operator: The retrieval operator will have to make more
complicated requests (both informational, and restoral) of the backup
system that the system operator will. Therefore, I expect that the
retrieval operator will need a slightly more sophisticated interface.
As such the retrieval operator will have a curses style interface.
This interface may be run on any client, and will use Kerberos to
authenticate the retrieval operator to the backup system. I foresee
this interface being very similar to the moira client for the moira
database system.
The retrieval operator will also need an interface that allows the
processing of many commands in one sitting. (For example, this will
be used if a mass restore of some sort is necessary.)
Exception handler: The exception handler will interact with the
system on a similar level to the retrieval operator. However, the
former will need to be able to issue more queries than the latter.
Therefore, the exception handler will user the same interface as the
retrieval operator, except that more queries will be allowed.
Wizard: The wizard will write their own interface, when they consider
emacs an insufficient interface to the system's database.
Queries
There are a number of queries that the retrieval operator and exception
handler will need to perform. These are the queries that are
considered mandatory:
Request a restore by entity
Option to restore to a different entity name
Change the backup frequency for entity type
Change the backup frequency for particular entity
One time change to when an entity can be backed up
Change an entity's type
Marking tape out of service
completly (as in "the tape has been used as a streamer" :-)
don't write anything new on it
Adding a tape to the pool
Set min & max tape pool count
Set conditions on when tape can be reused (by entity type??) # backups since,
elapsed time...
set/modify access rules (system) acls
Add an entity
Remove an entity
What tapes have this entity on it
Describe entity (when it was last backed up ...)
Describe Tape
Number of errors on a particular drive over time
Number of errors on a particular tape over time
Next n entities to be dumped
Next n tapes to be used.
Optional Queries:
There are more queries that would be helpful to have, but are not
considered requirements.
Request a restore by disk
Request a restore by server
These queries would make recovery from a disk crash easier. However,
this is not a failure mode we see very often. As long as there is a
way to batch request restores, it is acceptable to require the
retrieval operator to generate the list of entities (vos listvldb
-server casio)
Make archive
It would be nice to be able to tell the system to make an archive of
the world. Making such an archive would otherwise have to be done by
hand. This would be a good candidate for the use of a separate tape
pool, if they become supported.
Describe state of the system
What tape drives are there, what tapes are loaded, what the drives are
doing to the tapes, etc.
Updating entity list
During the course of normal operation in our environment, new volumes
(entities) are created (eg. when a new user registers). It is
imperative that these volumes end up in the backup system
automatically. One possible way to do this is to have Moira add the
entity to the backup db when the entity is created, since the creation
is something that Moira handles as it is necessary, anyway. However,
Moira occasionally fails to do something because of some outage, or
for another reason. Since no one is likely to notice an entity
missing from the backup database, it is important that there is a
fool-proof way of insuring that all entities are in the backup db.
Therefore, there needs to be a way to accomplish this from the backup
system itself.
Data Verification
Whenever backups are made it is desirable to ensure that the data was
written to tape correctly. However, it is impractical to re-read
every bit of data that the system writes, given the volume of data we
expect it to handle. The system will, however, spot test some of the
tapes.
Incremental Backup
There is no obvious way to make an incremental backup of an AFS
volume. However, at some point incrementals may be better defined for
AFS, or the backup system may be modified to work on some other type
of filesystem. Therefore, the ability to make incremental backups
will be designed into the backup system. However, that part of the
backup system, may not be operational for the first release.
AFS Cloning
In addition to the concept of backups as tape dumps, AFS has the
concept of nightly cloning. Since this nightly cloning will affect
the volumes that are the entities being dumped to tape, there is the
possibility of a collision. In order to avoid such a collision, the
backup system will control the nightly cloning and coordinate it with
the other backups.
ACLs
There will be two ACL's in the system. One that allows the queries
necessary to perform retrieval operations, and one that allows the
queries necessary for error handling (the latter will probably
contain the former in some fashion. It will be unnecessary to
provide an acl for the system operator, since this will be controlled
by who has access to and login permission on the tape dump servers.
There will, however, be a Kerberos authenticated list of the tape dump
servers.
Control (or what get's dumped when)
The system needs to have some way of specifying when (how often)
things should be dumped. (for each type of dump, (ie. full,
incremental)). These rules should allow consideration of the
following issues:
entity type
number of entities being dumped from a given
server
subnet
special considerations for specific entities
Optionally, (based on time required for implementation) There may be
an option, to prevent certain types of backups from occurring on
certain entities during certain times of day.