[2634] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: Please strongly consider backing out the zephyr servers

daemon@ATHENA.MIT.EDU (Camilla R Fox)
Tue Mar 6 23:33:15 2001

Message-Id: <200103070433.XAA17592@ops-2.mit.edu>
To: op@MIT.EDU
cc: Greg Hudson <ghudson@MIT.EDU>, Garry Zacheiss <zacheiss@MIT.EDU>,
        azary@MIT.EDU, John Hawkinson <jhawk@MIT.EDU>, release-team@MIT.EDU,
        winzephyr-release@MIT.EDU, wdc@MIT.EDU, jis@MIT.EDU
In-Reply-To: <EudMh7tz0001N_g2hN@mit.edu>
Date: Tue, 06 Mar 2001 23:33:12 -0500
From: Camilla R Fox <cfox@MIT.EDU>


First of all, I want people to know that I've been convinced by this
discussion that my decision not to back out the changes on Sunday
evening was the wrong one.  My apologies to everyone inconvenienced.

I agree that ops should be more willing to back out upgrades quickly
when they cause problems, but I'm a little troubled by the way the
issue is being drawn in black and white.  I still think there's a
judgement call to be made, although I freely admit that I messed up
this time.

Backing out any upgrade to a production server carries a risk of
catastrophic failure, even if the risk is slight; depending on the
service, there might also be a user visible outage involved.

Suppose I had tried to back out the zephyr server upgrade, and had
accidentally lost the subscriptions database in the process?  That
would obviously be a case of operator error, but operator error is a
risk we each face every time we login to a server.  I don't doubt that
the time of day chosen at which to do the back out would have been
criticized, had it not gone smoothly.

Suppose this debate had happened after we had upgraded our afs servers
to use a server binary that locked out occasional clients?  Had we
backed out that very day, it would have meant a 15 minute loss of
service to all clients, weighed against inconvenience to three or four
clients that we knew about, and perhaps others who were suffering in
silence.  I'm pretty sure that waiting for the weekly restart to
propagate a fix to the servers was consistent with the greater good.

I'd like whether to back out a buggy upgrade to be a matter of
judgement, rather than a matter of policy.

Camilla

home help back first fref pref prev next nref lref last post