[6924] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: Lucid status update

daemon@ATHENA.MIT.EDU (Jonathan Reed)
Fri Aug 27 12:25:40 2010

Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1081)
From: Jonathan Reed <jdreed@MIT.EDU>
In-Reply-To: <7E82F7A5-C5DD-44D8-AD6B-AD7338D580E7@mit.edu>
Date: Fri, 27 Aug 2010 12:25:33 -0400
Message-Id: <17FD1BE7-4818-451F-8DFF-C340852BCBC3@mit.edu>
To: "release-team@MIT.EDU" <release-team@mit.edu>
Content-Transfer-Encoding: 8bit

There are 68 machines that failed the upgrade.  Hotline will be visiting these machines to resume the update process.  45 of them are cluster machines.

Of the 11 failed machines in W20 (which can be taken as representative of the whole cluster environment) here are the reasons for the failures:
- 7 machines failed to get DHCP addresses because the DHCP servers failed to respond in a timely manner.  (A few users around campus reported DHCP sluggishness last night around midnight, but it was too late to take any action as far as the release was concerned, and there was insufficient information to consider paging NIST staff).  Retrying such installations worked fine.
- 1 machine was physically powered off.
- 1 machine had been removed from the cluster for hardware maintenance.
- 1 machine had its network cable unplugged.
- 1 machine failed a BIOS test.

Additionally, Hotline staff located two machines in the field (there are undoubtedly more) that failed the update because users had left themselves logged in overnight.  They were force-logged-out and the machines will retry tonight.

Overall, I believe these failures are representative of any network installation in our public cluster environment.   However, one potential solution is that future installations can use a static IP address for network configuration.  This functionality was previously broken in earlier versions of the Ubuntu installer, but we believe it works now.  

-Jon


On Aug 27, 2010, at 8:20 AM, Jonathan Reed wrote:

> At 8:01am, the desync period expired, so any machines that are going to update today have already begun the process.  
> 
> As of 8:15, there are:
> 
> - 221 completed Lucid installs
> - 193 in-progress Lucid installs (number may be high, as it includes any public machine which refuses athinfo connections)
> - 31 non-upgraded Jaunty machines (I'll visit a sampling of them later today to look at the upgrade.log, but I suspect it's machines that had someone logged in.)
> - 16 "other" (Athena 9, no route to host, etc)
> 
> ... out of a total of 461 machines (including clusters, quickstations, dorm machines and podium machines).  The in-progress installs should finish up over the next 1-8 hours (the higher values being reserved for 2-032 and 38-370, which each managed to only install 3 machines over a period of 6 hours).  The remainder of the Jaunty machines should upgrade tonight (Friday).
> 
> Lessons learned:
> - People use the clusters, even at 5am.
> - Installation takes way too long, and we should either start earlier or desync less.
> - Installation on 10/half networks is rapidly becoming unsustainable.
> - Retry sooner than 24 hours if the upgrade fails because someone is logged in (#694)
> - Add an athinfo query for the upgrade log
> - Having a modified athinfo daemon (which answers all queries with "Installation started at $timestamp") running during installation would be helpful, to differentiate "connection refused" because it's installing from "connection refused" because it's broken.
> 



home help back first fref pref prev next nref lref last post