[412] in athena10
Re: Automatic updates for Athena 10
daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Mon Aug 11 22:37:08 2008
Message-Id: <200808120236.m7C2aNLt013281@vorpal-blade.mit.edu>
From: Jonathon Weiss <jweiss@MIT.EDU>
To: Greg Hudson <ghudson@MIT.EDU>
cc: Kenneth Arnold <kcarnold@MIT.EDU>, athena10@MIT.EDU
In-reply-to: Your message of "Mon, 11 Aug 2008 16:21:41 EDT."
<1218486101.12433.177.camel@error-messages.mit.edu>
Date: Mon, 11 Aug 2008 22:36:23 -0400
> So, I think I get to do it from scratch. My general integration plan
> is:
>
> 1. Divert /etc/gdm/PostSession/Default (which does nothing by
> default). This script runs as root after a logout session and gdm
> blocks until it is complete. DISPLAY is set, but the X server might be
> dead in some cases.
>
> 2. Create a cron.hourly job which checks if anyone is logged in and
> runs an update, ideally blocking logins while it runs.
>
> The actual update process is just "aptitude update && aptitude
> full-upgrade" so most of the scripting will be the safety checks around
> it, possibly an attempt to communicate to the console that an update is
> happening, etc.. Also, /etc/sources.list.d/debathena.list will be
> updated depending on cluster info.
>
> On desynchronization: I am less concerned about this than I was for 9.4
> updates. First, Athena 10 updates should typically come in small pieces
> since we don't batch them up for months at a time. Second, the data
> will be served by athena10.mit.edu, so the load on the AFS servers
> should be lessened by the AFS cache on that machine. And third, the
> network for clusters is much better than it used to be.
I think I'm missing something. in '1' above, what are you planning
to have that script do once you've diverted it?
As for desynchronization, I have some concerns with dropping it
altogether. First, note that as a counter-argument to the AFS cache
helping the AFS servers, you're just concentrating the data on the
network of athena10.mit.edu (the AFS servers' bottleneck in this case
has always been their network, not the cpu or disk or anything). I
suppose this is mittigated if we do end up with load balancing among
multiple machines to serve the athena10 name, and but having faster
server networks, but it's stilla potential issue. Second, are you
asserting that there will never be a batch update from the upstream
vendor, a la RedHat's quarterly updates? This seems...implausible.
I also agree with the previosu assertion that having all of the empty
cluster seats update at once could be annoying to someone who walks
into the cluster right then.
I'd feel a lot better if we desync'd over at least the course of an
hour. Is there a particualr reason not to? I'm certainly willing to
say it doesn't matter for the pre-release, but when we go to full
production, I'd be nervous about lacking it.
As an aside, ops has started descynchronizing all of
/etc/cron.{monthly,weekly,daily,hourly} themselves. We got bitten by
one or two of the weekly and daily jobs firing off at once on 10-15
VMs and taking out the host they were running on. Not desyncronizing
things that run on many machines always seems to find a way to bite us.
Jonathon