[867] in athena10

home help back first fref pref prev next nref lref last post

Re: Auto-updating clusters

daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Tue Jan 20 17:11:14 2009

Message-Id: <200901202209.n0KM9qmV025408@speaker-for-the-dead.mit.edu>
From: Jonathon Weiss <jweiss@MIT.EDU>
To: Evan Broder <broder@MIT.EDU>
cc: debathena@MIT.EDU
In-reply-to: Your message of "Tue, 13 Jan 2009 18:05:43 EST."
             <496D1E47.1030002@mit.edu> 
Date: Tue, 20 Jan 2009 17:09:52 -0500


With athena 9 we invoke update_ws every 6 minutes and desync over 4
hours.  I'm happier with the 4 hours (14400 seconds)
desynchronization.  As we recently demonstrated with a
desynchronization failure, currently the server infrastructure is more
likely to be the bottleneck than the cluster networks.

Are you planing to do the desynchronization before or after you
determine whether there are actually any updates to apply.  If it is
after, you can certainly run the cron job every 5 minutes and desync
for 4 hours and have the same schedule we do now.  If you're desyncing
before figuring out if there is an update, then you probably don't
want to run the cron job a lot more often than the desync interval,
but a little more often might be okay.

	Jonathon


> Jon and I decided that we thought this was reasonable, but I wanted to
> bring it up here just in case others had input.
> 
> Currently debathena-auto-update runs twice an hour, using desync with a
> range of 0-1000 seconds (about 15 minutes). This means that if there's a
> large update (a new version of OOo or something unfortunate like that),
> it's very conceivable that you could end up with an entire cluster
> downed by the update process (since you can't login when updates are
> running).
> 
> I think that we should change the cron job to run every 2 hours instead
> of twice an hour, and adjust the argument to desync to space updates
> across that full 2 hour period.
> 
> I figure that (ignoring upgrades from one Ubuntu release to another) our
> worst case scenario update is bounded by how many changes show up in the
> apt repos in a 2 hour period. Given that, the worst case is probably an
> update that takes about 1/2 an hour to install (since Ubuntu doesn't do
> point-releases like Debian does). A 1/2 hour install period desynced
> over 2 hours results in about 3/4 of the heads in a cluster being usable
> at any given time during this worst-case update.
> 
> Given that such a worst-case update is relatively unlikely to happen
> normally, meaning that the average percentage of heads downed by an
> update would be much smaller, I think this is a reasonable period of time.
> 
> Do people object to me making that change?
> 
> - Evan



home help back first fref pref prev next nref lref last post