[152] in Athena_Backup_System
system requirements
daemon@ATHENA.MIT.EDU (Diane Delgado)
Mon Dec 18 15:01:46 1995
To: athena-backup@MIT.EDU
Date: Mon, 18 Dec 1995 15:01:37 EST
From: Diane Delgado <delgado@MIT.EDU>
We all need to ask ourselves, especially Brian and Dave, if we
were to receive a backup system whose features were only those listed
in the requiremetns list would we be happy with it? (assume you
know nothing else about the system and have only the list to
go by). My feeling is that the list is a good start for us
to begin writing this down, but there's a lot more which needs
to be stated.
1) The system must work in a crippled fashion
need to specifically state that it should support restore operations
without the master available via use of a simple test client
with a test master. We also need to write down what type
of handwork is acceptable for the operations staff to perform
when running during crippled mode (i.e., where are they getting the
tape to volume information from, etc.)
<5) The system shall require no more action than the current system
This will need more discussion to provide very specific requirements since
the initial setup time for the system will likely be greater because of
its greater flexibility; daily operations should not require a lot of work.
I'd like to see a list of common operations (e.g. backup/restore)
accompanied by what is acceptable for each.
<7) Is there a req for the backup cycle time
This is largely an operational policy which development should not
be held accountable for. It is dependent upon the level of the
dumps, the amount of data in the cell, the number of tape drives
available to the backup system, and the number of hours per day
the backup system is allowed to run.
I'd rather that we not generate requirements which are
dependent on vague and changing parameters such as "the size of the
cell" since the cell may grow a lot or the hardware configuration
may change between the time we write this and the time the system
enters production.
I propose that we consider some requirement whose constraints
are invariant such as "dumps x gigabytes in time t" using a device
of type "z". I haven't got a good feel for what "t" or "x" are.
We should also specify that the test be executed under reasonable
conditions, such as servers being available and not unduly overloaded.
<8) req for performance
There should be one - see #7; (7) can also be enhanced to
do some scalability testing to measure performance as
the number of devices increases and also performance
measurements with multiple slaves on a single machines.
Performance should include master db operations, as well
as backup and restore.
A measureable performance requirement will also help the
administrators get a feel for when it's time to do some
routine mantenance on the ABS, such as a db index rebuild,
since they will then have a baseline for comparison.
Before we agree to anything regarding performance, we need
to determine what is achieveable.
<9) How big a universe should it deal with
Yes we should support the size of the current cell. I am not
certain I favor the idea of trying to claim or show that it
will deal with 5 times that amount. This is mostly a practicality
issue - how will we create a cell with 1,000 gig if we don't
have the hardware resources? Yes I do understand that the system
isn't much good if it won't accommodate growth, but how do
we demonstrate or prove that it will?
<11) The system shall have the following documents: Programmer ref, Op guide
Admin guide, etc.
We should specify what kind of information is to be contained
within these documents.
For example, we probably want to specify that the Admin guide
should contain the sections which Brian and Dave have already
outlined.
<13) The system will deal with media errors on write.
This needs to be very specific as to the course of action to be taken since
I really can't see what else we can do except report an error.
Do you want it to ask for a new tape and restart the dump
automatically?
ADD the following (many of these are things which were incorporated
into the design based on what team members said was needed over the
past year):
- the system supports the notion of media pools
- the system shall disallow writing on tapes which contain valid data
- the external tape label is unique within the entire abs (we decided
this last week which is different from what the original requirements
state)
- volumes are not allowed to span tapes.
- slave must log critical errors to syslog and console
- slave must log mount requests to console
- error/message routing destinations and delivery methods
must be configureable
- system must support the ability to retry volumes which failed to dump
- system must not fail the entire dump if only one volume fails to dump
the minimum number of failed volumes which causes the dump to fail
shall be configureable
- system must support notion of authorization levels for
administrator and opeators
- system must use kerberos authentication to identify its clients.
- slave shall save its job results to disk if it cannot contact
the master.
- slave shall retry "important" calls indefinitely. These calls
include: job_done, validate_media, job_attention.
- system must provide the ability to mark known slave devices as
"unavailable" for backup system purposes
- system must support ability to reuse tapes
- system must support a reuse ceiling on tapes which is configurable
on a media pool basis.
- system must support ability to easily update its db with information
for new afs partitions and volumes when they are added to an server.
- system shall not rely on moira for any aspect of its operation.
- volumes which are members of the same dumpset must all be the same
filesystem type.
- The system will not abort if it detects any database record (data)
inconsistencies. Under such conditions, the system shall fail
the operation which manifested the problem, and log an error.
- Detected database inconsistencies shall be logged with sufficient
information so that the administrator can correct the problem.
- The system shall maintain state operation regarding slave-related
operations which are executing in the system. This state shall
persist across restarts of the master component.
- this list is not exhaustive; I am sure there are many more.