[62] in Athena_Backup_System
error handling
daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Fri Jan 13 20:19:05 1995
From: Jonathon Weiss <jweiss@MIT.EDU>
To: athena-backup@MIT.EDU
Date: Fri, 13 Jan 1995 20:18:52 EST
Here is a first draft at what the tape slave should do in all of the
error cases we came up with today. Yeah, it gets kinda terse at the
end, and there are something's that will need to change, but I wanted
to get you something to think about before I left. Feel free to write
down or email me comments, but remember that I won't read them for a
week.
Jonathon
slave->tape-drive errors
device not present (ie. /dev/tapedrive doesn't exist):
This particular error requires a system administrator to examine and
correct the configuration of the tape-slave (or possibly the drives
configured in the backup system). Therefore, it should be be
forwarded to the master along with notification that the job request
is being ignored, and should be rescheduled elsewhere.
drive time-out:
This can occur either when trying to start a job, or in the middle of
one. It is syymptomatic of a hardware failure. AS such it should be
reported back to the master, along with notification that the current
job is being aborted, and should be rescheduled elsewhere.
device busy:
We have no idea what is busying the drive (perhaps something besides
ABS using the tape-drive, prehaps some confusion in the kernel) or how
long it will be doing so. The error should be reported to teh master
along with notification that the current job is being ignored, and
should be rescheduled elsewere.
no tape:
The operator needs to deal, periodically he should be reminded
of this. How is the slave communicating with the operator?
offline:
See "no tape"
wrong tape
See "no tape"
media error:
A single media error may not be a problem. An error struct should be
generated. If it caused a volume to fail, that volume should be put
on the failed volume list to be retried. When the dumpset finishes
the error should be reported to the master for reference. If 'retry'
errors occur, the tape slave should report them to the master and
notify the master that it is aborting the dump.
end of tape
The drive should eject the tape, and go into the "no tape" state.
slave->server-to-be-backed-up errors
volume isn't here:
generate an error struct to be returned when the dumpset finishes.
Remove the volume from the list of dumped volumes and add it to a list
of failed volumes that are not to be re-attempted. (Since it's not
likely to show back up).
(volume has moved?)
Treat the saem as volume not here. (for Simplicity)
volume corrupt
generate an error struct to be returned when the dumpset finishes.
Remove the volume from the list of dumped volumes and add it to a list
of failed volumes that are not to be re-attempted. (Since it's not
likely to get un-corrupt without intervention).
can't get authentication
Report the error to the master, and notify it that the slave is
aborting the dumpset.
permission denied
The slave should re-authenticate, and retry the volume. If permission
is denied again generate an error to return to the master, and move
the volme to the list of volumes not to retry.
volume busy
Generate error, put on list of volumes to retry
connection breaks
Generate error, put on list of volumes to retry
can't open connection
Generate error, put on list of volumes to retry
slave->master errors
can't get connection
machine down
machine up, but master process down
These are all network, or hardware problems. Sleep and retry, ad
infiniteum.
master is ignoring us:
not much we can do; Sleep and retry, ad infiniteum.
can't get auth:
not much we can do; Sleep and retry, ad infiniteum.
auth error
not much we can do; Sleep and retry, ad infiniteum.
slave->local-os errors
gets a signal:
Ignore most signals. shutdown after current dumpset finishes and is
reported to master, on SIGUSR1. Ignore previous SIGUSR1 on SIGUSR2.
save state and shutdown oafter current volume on SIGHUP. shutdown at
earliest possible time that won't leave things like the vldb in a
confused state on SIGTERM. In the SIGHUP and SIGUSR1 cases inform
the master that you are going down.
permission denied
on tape-device or log area
Inform master of error and abort dump
out of a resource
disk
memory
processes
Inform master of error and abort dump
starting up
after machine shutdown
Depending upon whether the tape slave process was shutdown before the
machine was, this will either fall in the pricess shutdown or process
crash state below
after machine crash
This can be looked at as a special case of the process crash below
after process crash
Inform master slave is back, return logged errors, but not successes,
abort dump.
after process shutdown
restore state from logs, inform master slave is back, and continuing
if in the middle of a job.