[3212] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Cleanup plan for 9.0.25

daemon@ATHENA.MIT.EDU (Greg Hudson)
Fri Mar 29 11:04:36 2002

Date: Fri, 29 Mar 2002 11:01:07 -0500
Message-Id: <200203291601.LAA29900@error-messages.mit.edu>
From: Greg Hudson <ghudson@MIT.EDU>
To: release-team@mit.edu

The 9.0.25 release for Linux had two serious problems, one of which
has been talked about at great length and the other of which hasn't.

  1. Our frobbing of the EEPROM settings on GX150s put many machines
     in a worse state than they were before, knocking some off the net
     entirely and leaving others in a state where network performance
     is very slow and lossy.

Bill did some research and determined that we want to (1) set the
EEPROM values to be identical to the values as they come from Dell,
and (2) reset the MII unit after setting EEPROM values.  Garry
verified his findings.  I have put out a 9.0.26 patch release to the
dev cell which does those things.  We can put it out to the Athena
cell this weekend; we will determine the exact timing when Bill and
Jonathon are around.

Machines which are in the "slow network" state will hopefully be able
to take the update and be fixed by Monday.  Machines in the "no
network" state will obviously need to be visited.  There are four
machines in W20 which work fine but were pegged at 10Mbps by Garry;
these machines should go back to the pristine state with the 9.0.26
patch release.

I will leave it to Bill to identify who will visit the "no network"
machines and fix them.  If it's cluster services, they will need to be
issued carefully written instructions, of course.  This should happen
before students return from spring break Monday, and perhaps sooner.

  2. The list-9.0.25 file contained two RPMs from
     redhat-7.1/RedHat/RPMs which had been obsoleted by update RPMs
     from Red Hat, specifically dump-static and
     XFree86-ISO8859-2-Type1-fonts.  As a result, our list file
     contained conflicts.  We found this out when Garry noticed that
     public workstation verification was spewing errors.

The update did not fail because "Obsoletes:" headers in the new RPMs
caused the old RPMs to be removed on machines taking the update, even
though the list files did not specify those removals.  Machines which
updated to 9.0.25 wound up with the proper set of RPMs, but their RPM
set did not match /var/athena/release-rpms.  Not a major catastrophe.

Installs of 9.0.25 did not fail because we install with
--replacefiles.  However, they wound up with a set of conflicting
RPMs.  A future update which removed the dump-static and font RPMs
would presumably result in important files being missing from the
machine.  There is no mechanism in the update to avoid this problem.

Recognizing the possibility for disaster here, Garry and I acted this
morning.  Garry produced a list of all Linux machines which have
booted since Monday and I scanned them to determine which machines had
been installed at 9.0.25.  Fortunately, only four machines have been:
two test machines in W92 and two public machines in M1.  The public
machines are not a concern because public workstation verification
will rectify any missing files.

To prevent any more machines from being installed in the broken state
between now and 9.0.26, I edited list-9.0.25 to remove the obsoleted
RPMs and Garry propagated that change to the athena cell.

When Andrew gets back from vacation, he should:

  * Modify the install not to use --replacefiles.  Using it is just
    asking for trouble.

  * Modify his procedures for building lists so that he can detect
    when RPMs have been obsoleted, so that we don't get into this
    situation again.

home help back first fref pref prev next nref lref last post