[255] in Athena_Backup_System
Revised requirements
daemon@ATHENA.MIT.EDU (Bill Cattey)
Mon Jun 10 20:28:33 1996
Date: Mon, 10 Jun 1996 20:28:21 -0400 (EDT)
From: Bill Cattey <wdc@MIT.EDU>
To: athena-backup@MIT.EDU
I've made a pass through the requirements to incorporate response to
issues raised in the recent review with Jeff, Tom, Paul, and Matt. This
note tells what changes I made, and what questions I have about the
changes I made.
-> 0. I indented things a little more consistently so sections with two digit
requirement numbers would line up properly. Use diff -w to ignore this.
-> 1. Under DESIGN CRITERIA:
j) set/modify access control.
becomes
j) Allow authorized administators to add to / remove from
the list of authorized administrators and the
the list of authorized operators.
Remember when we discusse this and some people asked "what do you
mean by access control?" Well, others asked. This wording states what
the access control really is. Did I get it right?
-> 2. new section ENVIRONMENTAL IMPACT
ENVIRONMENTAL IMPACT
1) The system will no more seriously impact the operation of the file servers
than the existing backup system.
As I mentioned in the ABS meeting of 6 June, Matt expressed concern over
the impact on the cell from how we synchronized our database with the
list of volumes on a server. Does this requirement, as I have stated
it, seem like a requirement we can meet?
-> 3. under DATABASE with regards to data inconsistencies:
5) The system will not abort if it detects any data base record (data)
inconsistencies. Under such conditions, the system shall fail
the operation which manifested the problem, and log an error.
becomes
5) The system will not crash if it detects any data base record (data)
inconsistencies. Under such conditions, the system shall abort
the operation which manifested the problem, and log an error.
Care will be taken to properly categorize fatal and non fatal errors.
Readers were mis-interpreting the requirement as stated and got worried
that a system that mindlessly said OOPS and kept going until it hurt itself
would be written. I believe this wording captures the intent in a less
ambiguous way.
-> 4. under DATABASE with regards to recovery:
8) In the event of a database disk failure, the Procedures documented
Oracle Systems Administrator's guide will be followed for system
recovery.
NOTE: It is a configuration requirement that Oracle operate in
"Archive Log" Mode for this recovery to restore up to the
the last committed transaction, rather than the last full
backup of the Oracle Database.
becomes
8) In the event of a database disk failure, the Procedures documented
Database Systems Administrator's guide will be followed for system
recovery.
NOTE: It is a configuration requirement that the database operate in
"Archive Log" Mode for recovery to restore up to the the last
committed transaction, rather than the last full backup of
the Database.
It was pointed out that the requirements should not stipulate a particular
database. Even though the team regards Oracle as the best database for
the job, the requirements can be stated more generically and this
strengthens the requirements. Recognizing that the Oracle methods are
the best, I have (subject to the team's approval) added this:
10) Before putting the system into production, the recovery procedures
documented in the Database Systems Administrator's guide will be
tested to confirm they work. If the procedures do not work additional
recovery procedures will be developed, documented, and tested.
This will be a matter of testing where others never tested before. I am
not going to allow this to turn into an exercise in being forced to
write new procedures where existing ones are sufficient.
-> 5. under SYSTEM with regards to downtime.
4) The system shall be back in degraded service within one hour from the
time that operations notices it is down. (See CRIPPLED MODE below.)
5) The system shall be back in full service within 48 hours from the
time that operations notices it is down.
These downtime requirements were not stated explicitly, but they seemed
to be close to what the customers were saying they wanted and what I felt
we could provide. The 1 hour to crippled mode came from asking Matt how
long he felt was acceptable (and he said two hours) and then sanity
checking it with Jonathon (who felt we could probably do it within an
hour) and then my feeling Roger and Susan would feel good about one hour.
The main system downtime number came from two directions: Matt
suggested we express it in terms of a dump cycle. It would be good, he
felt, if we didn't fall behind if we went to crippled mode. Both Jeff
and Matt
pointed out that if the system was successful, expectations would rise.
From all that, I got the idea that dump cycles would be a week or less,
and that 48 hours felt like something I could sell people. (At various
times Jeff
seemed to sense I was trying to sell him on 48 hours or less of downtime
INSTEAD of an abstraction layer, and he would have none of it.)
-> 6. CRIPPLED MODE is now read / write
2) Crippled mode does not offer full functionality. Restore only is
acceptable.
becomes
2) The system shall be able to perform dumps of limited functionality
when the database or Master is offline through the use of a Crippled
Mode user interface client. At minimum, the Crippled mode dump
command will prompt for a server name, and dump all partitions and
volumes from the named server.
This incorporates the stated requirements as the reviewers saw them, and
seems eminently doable from our standpoint. I think I have this right.
-> 7. CRIPPLED MODE word smithing the volume/tape table requirement:
5) The correlation for the tape to volume information and tape to
partition information will be provided by an ASCII report that is
produced as part of periodic maintenance procedures.
becomes
5) The correlation information of the tape to volume and tape to
partition required for restores will be provided by an ASCII
report that is produced as part of periodic maintenance
procedures.
I felt this wording was a little easier to understand.
-> 8. CRIPPLED MODE -- how many slaves?
6) A Crippled Mode interface client need only handle a single media
slave.
7) It is acceptable that Crippled Mode limit the number of slaves
which can run simultaneously on a single host.
becomes
6) A Crippled Mode interface client should be able to drive as many
media slaves as the normal mode Master.
This way says it clearer, simpler, and I belive it was our consensus
implicitly that this was possible.
-> 9. CRIPPLED MODE -- a requirement if we do better than the minimum on dumps.
7) In the event that dump service of finer granularity than a whole
server (i.e. a partition or collection of volumes), the mechanism
developed for determining which volumes to dump will take care to
notice volumes created on a file server subsequent to the crash
of the Master.
The minimum dump requirement was stated, but another requirement was
stated if we go beyond the minimum: "Don't miss newly created users!".
I see no problem here.
-> 10. CRIPPLED MODE -- The tape must be blank.
8) Crippled mode dumps will NEVER write to a tape with a pre-existing
ABS backup label. Crippled mode dumps will look for a label
and will only continue if no label is found.
Jeff says crippled mode should only write onto blank tapes. He doesn't
even want a utility to make tapes pretend they're blank. We can provide
this without a problem.
-> 11. CRIPPLED MODE -- We should keep track of what we dump.
9) Crippled mode dumps will record status information in a format suitable
to update the Master database with accurate records volume dump dates
when it comes back online.
10) A utility will be written and tested which will update the master database
with the logging output of the crippled mode dump.
In the meeting there was no firm statement of what to do with the dump
status of volumes dumped during crippled mode. I think all parties
agree that SOMETHING should go back into the database when it comes back
up. In my ignorance about how we keep status, I guessed that if we kept
a list of volumes dumped with their dump date, that would be sufficient.
Diane, Brian: Could you correct me on what status should be retained?
Diane: Could you correct me if I've stated something that isn't done
with the present system. IE do we keep partition level dates too?
-> 12. CRIPPLED MODE error message requirement moved:
11) Crippled Mode should provide a means to funnel all error messages
and event notifications from all components through a single channel.
-> 13. CRIPPLED MODE performance:
12) Crippled Mode performance should be no worse than the performance of the
dump system being replaced by ABS.
An explicit requirement was not stated, but there seemed to be thinking
along the lines that if crippled mode was no worse than what we had now,
AND NO BETTER that it would be ok.