[62] in Athena_Backup_System

home help back first fref pref prev next nref lref last post

error handling

daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Fri Jan 13 20:19:05 1995

From: Jonathon Weiss <jweiss@MIT.EDU>
To: athena-backup@MIT.EDU
Date: Fri, 13 Jan 1995 20:18:52 EST


Here is a first draft at what the tape slave should do in all of the
error cases we came up with today.  Yeah, it gets kinda terse at the
end, and there are something's that will need to change, but I wanted
to get you something to think about before I left.  Feel free to write
down or email me comments, but remember that I won't read them for a
week.

	Jonathon


slave->tape-drive errors

	device not present (ie. /dev/tapedrive doesn't exist):
This particular error requires a system administrator to examine and
correct the configuration of the tape-slave (or possibly the drives
configured in the backup system).  Therefore, it should be be
forwarded to the master along with notification that the job request
is being ignored, and should be rescheduled elsewhere.

	drive time-out:
This can occur either when trying to start a job, or in the middle of
one.  It is syymptomatic of a hardware failure.  AS such it should be
reported back to the master, along with notification that the current
job is being aborted, and should be rescheduled elsewhere.

	device busy:
We have no idea what is busying the drive (perhaps something besides
ABS using the tape-drive, prehaps some confusion in the kernel) or how
long it will be doing so.  The error should be reported to teh master
along with notification that the current job is being ignored, and
should be rescheduled elsewere.

	no tape:
The operator needs to deal, periodically he should be reminded
of this.  How is the slave communicating with the operator?

	offline:
See "no tape"

	wrong tape
See "no tape"

	media error:
A single media error may not be a problem.  An error struct should be
generated.  If it caused a volume to fail, that volume should be put
on the failed volume list to be retried.  When the dumpset finishes
the error should be reported to the master for reference.  If 'retry'
errors occur, the tape slave should report them to the master and
notify the master that it is aborting the dump.

	end of tape
The drive should eject the tape, and go into the "no tape" state.



slave->server-to-be-backed-up errors
	volume isn't here:
generate an error struct to be returned when the dumpset finishes.
Remove the volume from the list of dumped volumes and add it to a list
of failed volumes that are not to be re-attempted.  (Since it's not
likely to show back up).

	(volume has moved?)
Treat the saem as volume not here. (for Simplicity)

	volume corrupt
generate an error struct to be returned when the dumpset finishes.
Remove the volume from the list of dumped volumes and add it to a list
of failed volumes that are not to be re-attempted.  (Since it's not
likely to get un-corrupt without intervention).

	can't get authentication
Report the error to the master, and notify it that the slave is
aborting the dumpset.

	permission denied
The slave should re-authenticate, and retry the volume.  If permission
is denied again generate an error to return to the master, and move
the volme to the list of volumes not to retry.

	volume busy
Generate error, put on list of volumes to retry

	connection breaks 
Generate error, put on list of volumes to retry

	can't open connection
Generate error, put on list of volumes to retry



slave->master errors
	can't get connection
		machine down
		machine up, but master process down

These are all network, or hardware problems.  Sleep and retry, ad
infiniteum.

	master is ignoring us:
not much we can do; Sleep and retry, ad infiniteum.

	can't get auth:
not much we can do; Sleep and retry, ad infiniteum.

	auth error
not much we can do; Sleep and retry, ad infiniteum.



slave->local-os errors

	gets a signal:
Ignore most signals. shutdown after current dumpset finishes and is
reported to master, on SIGUSR1.  Ignore previous SIGUSR1 on SIGUSR2.
save state and shutdown oafter current volume on SIGHUP.  shutdown at
earliest possible time that won't leave things like the vldb in a
confused state on SIGTERM.  In the SIGHUP and SIGUSR1 cases inform
the master that you are going down.

	permission denied
		on tape-device or log area
Inform master of error and abort dump

	out of a resource
		disk
		memory
		processes
Inform master of error and abort dump

	starting up
		after machine shutdown
Depending upon whether the tape slave process was shutdown before the
machine was, this will either fall in the pricess shutdown or process
crash state below

		after machine crash
This can be looked at as a special case of the process crash below

		after process crash
Inform master slave is back, return logged errors, but not successes,
abort dump.

		after process shutdown
restore state from logs, inform master slave is back, and continuing
if in the middle of a job.


home help back first fref pref prev next nref lref last post