[15696] in Athena Bugs
Zwrite collision?
daemon@ATHENA.MIT.EDU (Ted McCabe)
Tue Nov 25 12:32:46 1997
To: ops@MIT.EDU
Cc: bugs@MIT.EDU
Date: Tue, 25 Nov 1997 12:32:42 EST
From: Ted McCabe <ted@MIT.EDU>
Yesterday, one of our backups was delayed starting by 15 hours. It
appears that the culprit was the zwrite command:
/usr/athena/bin/zwrite -s $host -O auto -d -n -c afsadm -i backups -m "Starting Dump of $vset /$archive on port $argv[2]"
(in this case, $host = chain, $vset = 158B, $archive = full and $argv[2]=1)
From the afsadm zephyr log (/mit/afsadm/afsadm.0 at the time of this
writing):
---------------
Instance: backups Time: 11:02:05 Date: Mon Nov 24 1997 Host: chain Opcode: auto
From: chain <root>
Starting Dump of alldev /full on port 0
<interim zephyrs from other hosts deleted>
Instance: backups Time: 02:12:06 Date: Tue Nov 25 1997 Host: chain Opcode: auto
From: chain <root>
Done with alldev.full on port 30
Tape Log File:
02:11:14 Task 30002: End of pass 1: Volumes remaining = 0
02:11:19 Task 30002: alldev.full (DumpId 880387339): Finished. 208 volumes dumped
Errors File:
Instance: backups Time: 11:02:05 Date: Mon Nov 24 1997 Host: chain Opcode: auto
From: chain <root>
Starting Dump of 158B /full on port 1
----------------
Notice that the second zephyr with 11:02:05 Mon timestamp actually isn't
received until after 2:12:06 Tue (also verified by eyewitness
accounts).
Other logs indicate that the backups continued normally within seconds
of the 2:12:06 zephyr indicating that the script was waiting on the
zwrite to finish.
Checking other zephyr logs indicates that it's not uncommon for
zephyrs to be sent from the same machine within 1 second of each
other (most likely on Sunday and Monday). My guess is that there's
some collision/race condition that can happen if two zwrites fire up
at the same time.
Ops: I'm going to add two line of code to the backup scripts that
should make this event even less likely.
--Ted