[255] in Athena_Backup_System

home help back first fref pref prev next nref lref last post

Revised requirements

daemon@ATHENA.MIT.EDU (Bill Cattey)
Mon Jun 10 20:28:33 1996

Date: Mon, 10 Jun 1996 20:28:21 -0400 (EDT)
From: Bill Cattey <wdc@MIT.EDU>
To: athena-backup@MIT.EDU

I've made a pass through the requirements to incorporate response to
issues raised in the recent review with Jeff, Tom, Paul, and Matt.  This
note tells what changes I made, and what questions I have about the
changes I made.

-> 0. I indented things a little more consistently so sections with two digit
requirement numbers would line up properly.  Use diff -w to ignore this.

-> 1. Under DESIGN CRITERIA:

       j) set/modify access control.

becomes

       j) Allow authorized administators to add to / remove from
               the list of authorized administrators and the
               the list of authorized operators.

Remember when we discusse this and some people asked "what do you
mean by access control?"  Well, others asked.  This wording states what
the access control really is.  Did I get it right?

-> 2. new section ENVIRONMENTAL IMPACT

 ENVIRONMENTAL IMPACT
 
 1)  The system will no more seriously impact the operation of the file servers
     than the existing backup system.
 
As I mentioned in the ABS meeting of 6 June, Matt expressed concern over
the impact on the cell from how we synchronized our database with the
list of volumes on a server.  Does this requirement, as I have stated
it, seem like a requirement we can meet?

-> 3. under DATABASE with regards to data inconsistencies:

 5) The system will not abort if it detects any data base record (data)
    inconsistencies.  Under such conditions, the system shall fail
    the operation which manifested the problem, and log an error.

becomes

 5)  The system will not crash if it detects any data base record (data)
     inconsistencies.  Under such conditions, the system shall abort
     the operation which manifested the problem, and log an error.
     Care will be taken to properly categorize fatal and non fatal errors.

Readers were mis-interpreting the requirement as stated and got worried
that a system that mindlessly said OOPS and kept going until it hurt itself
would be written.  I believe this wording captures the intent in a less
ambiguous way.

-> 4. under DATABASE with regards to recovery:

  8) In the event of a database disk failure, the Procedures documented
    Oracle Systems Administrator's guide will be followed for system
    recovery.
  NOTE:  It is a configuration requirement that Oracle operate in
    "Archive Log" Mode for this recovery to restore up to the
    the last committed transaction, rather than the last full
    backup of the Oracle Database.
  
becomes

  8)  In the event of a database disk failure, the Procedures documented
     Database Systems Administrator's guide will be followed for system
     recovery.
  NOTE:  It is a configuration requirement that the database operate in
     "Archive Log" Mode for recovery to restore up to the the last
     committed transaction, rather than the last full backup of
     the Database.
  
It was pointed out that the requirements should not stipulate a particular
database.  Even though the team regards Oracle as the best database for
the job, the requirements can be stated more generically and this
strengthens the requirements.  Recognizing that the Oracle methods are
the best, I have (subject to the team's approval) added this:

 10) Before putting the system into production, the recovery procedures
     documented in the Database Systems Administrator's guide will be
     tested to confirm they work.  If the procedures do not work additional
     recovery procedures will be developed, documented, and tested.

This will be a matter of testing where others never tested before.  I am
not going to allow this to turn into an exercise in being forced to
write new procedures where existing ones are sufficient.

-> 5. under SYSTEM with regards to downtime.

 4)  The system shall be back in degraded service within one hour from the
     time that operations notices it is down.  (See CRIPPLED MODE below.)
  
 5)  The system shall be back in full service within 48 hours from the
     time that operations notices it is down.
  
These downtime requirements were not stated explicitly, but they seemed
to be close to what the customers were saying they wanted and what I felt
we could provide.  The 1 hour to crippled mode came from asking Matt how
long he felt was acceptable (and he said two hours) and then sanity
checking it with Jonathon (who felt we could probably do it within an
hour) and then my feeling Roger and Susan would feel good about one hour.

The main system downtime number came from two directions:  Matt
suggested we express it in terms of a dump cycle.  It would be good, he
felt, if we didn't fall behind if we went to crippled mode.  Both Jeff
and Matt
pointed out that if the system was successful, expectations would rise. 
From all that, I got the idea that dump cycles would be a week or less,
and that 48 hours felt like something I could sell people.  (At various
times Jeff
seemed to sense I was trying to sell him on 48 hours or less of downtime
INSTEAD of an abstraction layer, and he would have none of it.)

 -> 6.  CRIPPLED MODE  is now read / write

 2)  Crippled mode does not offer full functionality.  Restore only is
     acceptable.

becomes

 2)  The system shall be able to perform dumps of limited functionality
     when the database or Master is offline through the use of a Crippled
     Mode user interface client.  At minimum, the Crippled mode dump
     command will prompt for a server name, and dump all partitions and
     volumes from the named server.

This incorporates the stated requirements as the reviewers saw them, and
seems eminently doable from our standpoint.  I think I have this right.

-> 7. CRIPPLED MODE  word smithing the volume/tape table requirement:

 5)  The correlation for the tape to volume information and tape to
     partition information will be provided by an ASCII report that is
     produced as part of periodic maintenance procedures.

becomes

 5)  The correlation information of the tape to volume and tape to
     partition required for restores will be provided by an ASCII 
     report that is produced as part of periodic maintenance
     procedures.

I felt this wording was a little easier to understand.

-> 8.  CRIPPLED MODE -- how many slaves?

 6)  A Crippled Mode interface client need only handle a single media
     slave.
  
 7)  It is acceptable that Crippled Mode limit the number of slaves
     which can run simultaneously on a single host.

becomes

 6)  A Crippled Mode interface client should be able to drive as many
     media slaves as the normal mode Master.
  
This way says it clearer, simpler, and I belive it was our consensus
implicitly that this was possible.

-> 9. CRIPPLED MODE -- a requirement if we do better than the minimum on dumps.

 7)  In the event that dump service of finer granularity than a whole
     server (i.e. a partition or collection of volumes), the mechanism
     developed for determining which volumes to dump will take care to
     notice volumes created on a file server subsequent to the crash
     of the Master.
  
The minimum dump requirement was stated, but another requirement was
stated if we go beyond the minimum:  "Don't miss newly created users!".
I see no problem here.

-> 10. CRIPPLED MODE -- The tape must be blank.

 8)  Crippled mode dumps will NEVER write to a tape with a pre-existing
     ABS backup label.  Crippled mode dumps will look for a label
     and will only continue if no label is found.
  
Jeff says crippled mode should only write onto blank tapes.  He doesn't
even want a utility to make tapes pretend they're blank.  We can provide
this without a problem.

-> 11. CRIPPLED MODE -- We should keep track of what we dump.

 9)  Crippled mode dumps will record status information in a format suitable
     to update the Master database with accurate records volume dump dates
     when it comes back online.

 10) A utility will be written and tested which will update the master database
     with the logging output of the crippled mode dump.
  
In the meeting there was no firm statement of what to do with the dump
status of volumes dumped during crippled mode.  I think all parties
agree that SOMETHING should go back into the database when it comes back
up.  In my ignorance about how we keep status, I guessed that if we kept
a list of volumes dumped with their dump date, that would be sufficient.
 

Diane, Brian:  Could you correct me on what status should be retained?
Diane: Could you correct me if I've stated something that isn't done
with the present system.  IE do we keep partition level dates too?

-> 12.  CRIPPLED MODE error message requirement moved:

 11) Crippled Mode should provide a means to funnel all error messages
     and event notifications from all components through a single channel.
 
-> 13. CRIPPLED MODE performance:

 12) Crippled Mode performance should be no worse than the performance of the
     dump system being replaced by ABS.

An explicit requirement was not stated, but there seemed to be thinking
along the lines that if crippled mode was no worse than what we had now,
AND NO BETTER that it would be ok.


home help back first fref pref prev next nref lref last post