[152] in Athena_Backup_System

home help back first fref pref prev next nref lref last post

system requirements

daemon@ATHENA.MIT.EDU (Diane Delgado)
Mon Dec 18 15:01:46 1995

To: athena-backup@MIT.EDU
Date: Mon, 18 Dec 1995 15:01:37 EST
From: Diane Delgado <delgado@MIT.EDU>

We all need to ask ourselves, especially Brian and Dave, if we
were to receive a backup system whose features were only those listed
in the requiremetns list would we be happy with it?  (assume you
know nothing else about the system and have only the list to
go by).  My feeling is that the list is a good start for us
to begin writing this down, but there's a lot more which needs
to be stated.




1) The system must work in a crippled fashion

      need to specifically state that it should support restore operations
      without the master available via use of a simple test client
      with a test master.  We also need to write down what type
      of handwork is acceptable for the operations staff to perform
      when running during crippled mode (i.e., where are they getting the 
      tape to volume information from, etc.)



<5) The system shall require no more action than the current system


    This will need more discussion to provide very specific requirements since
    the initial setup time for the system will likely be greater because of
    its greater flexibility; daily operations should not require a lot of work.
    I'd like to see a list of common operations (e.g. backup/restore) 
    accompanied by what is acceptable for each.

    

<7) Is there a req for the backup cycle time
    
    This is largely an operational policy which development should not
    be held accountable for.  It is dependent upon the level of the
    dumps, the amount of data in the cell, the number of tape drives
    available to the backup system, and the number of hours per day
    the backup system is allowed to run.

    I'd rather that we not generate requirements which are
    dependent on vague and changing parameters such as "the size of the 
    cell" since the cell may grow a lot or the hardware configuration
    may change between the time we write this and the time the system 
    enters production.

    I propose that we consider some requirement whose constraints
    are invariant such as "dumps x gigabytes in time t" using a device
    of type "z". I haven't got a good feel for what "t" or "x" are. 
    We should also specify that the test be executed under reasonable
    conditions, such as servers being available and not unduly overloaded.
    



<8) req for performance

    There should be one - see #7;  (7) can also be enhanced to
    do some scalability testing to measure performance as
    the number of devices increases and also performance
    measurements with multiple slaves on a single machines.
    Performance should include master db operations, as well
    as backup and restore.

    A measureable performance requirement will also help the
    administrators get a feel for when it's time to do some
    routine mantenance on the ABS, such as a db index rebuild,
    since they will then have a baseline for comparison.

    Before we agree to anything regarding performance, we need
    to determine what is achieveable.


<9) How big a universe should it deal with

    Yes we should support the size of the current cell.  I am not
    certain I favor the idea of trying to claim or show that it
    will deal with 5 times that amount.  This is mostly a practicality
    issue - how will we create a cell with 1,000 gig if we don't
    have the hardware resources?  Yes I do understand that the system
    isn't much good if it won't accommodate growth, but how do
    we demonstrate or prove that it will?


<11) The system shall have the following documents: Programmer ref, Op guide
Admin guide, etc.
  
     We should specify what kind of information is to be contained
     within these documents.

     For example, we probably want to specify that the Admin guide
     should contain the sections which Brian and Dave have already
     outlined.


<13) The system will deal with media errors on write.

    This needs to be very specific as to the course of action to be taken since
    I really can't see what else we can do except report an error.
    Do you want it to ask for a new tape and restart the dump
    automatically?



ADD the following (many of these are things which were incorporated
into the design based on what team members said was needed over the
past year):

  - the system supports the notion of media pools

  - the system shall disallow writing on tapes which contain valid data

  - the external tape label is unique within the entire abs (we decided
    this last week which is different from what the original requirements
    state)

  - volumes are not allowed to span tapes.

  - slave must log critical errors to syslog and console

  - slave must log mount requests to console

  - error/message routing destinations and delivery methods
     must be configureable

  - system must support the ability to retry volumes which failed to dump

  - system must not fail the entire dump if only one volume fails to dump
    the minimum number of failed volumes which causes the dump to fail
    shall be configureable
  
  - system must support notion of authorization levels for
     administrator and opeators

  - system must use kerberos authentication to identify its clients.

  - slave shall save its job results to disk if it cannot contact
    the master.

  - slave shall retry "important" calls indefinitely.  These calls
    include: job_done, validate_media, job_attention.

  - system must provide the ability to mark known slave devices as
    "unavailable" for backup system purposes
 
  - system must support ability to reuse tapes

  - system must support a reuse ceiling on tapes which is configurable
    on a media pool basis.

  - system must support ability to easily update its db with information
    for  new afs partitions and volumes when they are added to an server.

  - system shall not rely on moira for any aspect of its operation.

  - volumes which are members of the same dumpset must all be the same 
    filesystem type.

  - The system will not abort if it detects any database record (data)
    inconsistencies.  Under such conditions, the system shall fail
    the operation which manifested the problem, and log an error.

  - Detected database inconsistencies shall be logged with sufficient
    information so that the administrator can correct the problem.

  - The system shall maintain state operation regarding slave-related
    operations which are executing in the system.  This state shall
    persist across restarts of the master component.


  - this list is not exhaustive; I am sure there are many more.

home help back first fref pref prev next nref lref last post