[19590] in Athena Bugs

home help back first fref pref prev next nref lref last post

Linux behaves poorly in the face of network outages

daemon@ATHENA.MIT.EDU (Mitchell E Berger)
Sat Aug 11 02:03:39 2001

Message-Id: <200108110603.CAA01987@byte-me.mit.edu>
To: bugs@mit.edu
Date: Sat, 11 Aug 2001 02:03:14 -0400
From: Mitchell E Berger <mitchb@MIT.EDU>

Over the past couple of weeks, Athena machines have sort of had a trial by
fire in surviving through network outages.  Recently, I've noticed a couple
of people asking OLC questions about why they're frequently being given temp
homedirs and saying that it only happens on Linux machines.  Last night at
dinner, a friend of mine mentioned the same thing, and when asked if he could
reproduce it, he mentioned that it happens a lot on a particular quickstation.
I checked, and sure enough, that quickstation is the machine the last OLC
question about this came from.  Poking with athinfo, I found that the machine
had an update.desync, but was running 9.0.13.  It took 9.0.13 at an appropriate
time (based on when it was released), and the update.desync file indicated that
it should update a few days after that.

I visited the machine today and Bob and I spent a couple hours debugging over
zephyr.  The problem seems to be this: in the event of a network outage,
reactivate still runs update_ws, which we expect to try running getcluster and
fail to get its hesiod info that way.  Then it should take default values of
$SYSPREFIX=/afs/athena.mit.edu/system/rhlinux and
$SYSCONTROL=control/control-current.  When it tries to cd to $SYSPREFIX, it
should fail and update_ws should error out.  However, if the machine has not
been in active use lately, reactivate has been running fairly frequently, and
thus /afs/athena.mit.edu/system/rhlinux is in the AFS cache.  Thus the cd will
succeed even though the network is out.

However, a PUBLIC=true machine will then try to get the last line of the
control file and read it into $newvers and $newlist.  Currently, control-current
does not point at control-9.0, which all the public machines should be using,
and which might also be in their AFS cache.  Instead, it's pointing at
control-8.4.  I think this is probably a bug since 9.0 has gone public.

Either way, control-8.4 shouldn't be in a public machine's AFS cache, so tailing
the file will fail, and $newvers and $newlist will be empty strings.  The only
time a public machine will not try to take an update is when $newvers is the
same as the version it's currently running, and "" is different from "9.0.13",
so essentially the machine schedules itself to update to nothing.  When the time
for the update comes, presumably the network will be back and the machine will
no longer think there's a new version, so the update.desync file won't get
touched.

There's no direct problem with that, however, the next time a patch is released,
any machine stuck in that state won't desynchronize itself and will take the
update immediately because update.desync will exist with an earlier time.  The
way to avoid this is to ensure the version isn't empty before scheduling an
update; I'll submit a patch for this.

I scanned the public Linux cluster with athinfo, and there are 24 machines that
responded that they had an update.desync, and Bob and I think Suns and SGIs
aren't vulnerable to this type of problem because their update system is so
different, but it might be worth checking before 9.0.14 goes out.

Of course giving out temp homedirs is what we want to happen when the network
imposes AFS problems, but the reason quickstation-11-2 continued to give them
out from August 1st (the time of a network outage) until last night is that it
was rebooted at an unfortunate time during the outage and couldn't mount /afs
when it came up.  Nobody had rebooted it since until I did.  The rvdinfo 
athinfo query can be used on Suns and SGIs to check AFS, but there's nothing
to check AFS on Linux - Bob and I think it would be useful to have one, but
haven't yet come up with an implementation.

Another bug that this reveals is probably that a failure to mount AFS neither
results in the machine refusing to log you in and telling you to notify
Hotline, nor does the machine ever try to recover AFS on successive reactivates.

I'm going to send this to the OLC user and my friend to explain what happened
and thank them for alerting us to these bugs.

Mitch

home help back first fref pref prev next nref lref last post