[756] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Status on SGI situation.

daemon@ATHENA.MIT.EDU (Bill Cattey)
Wed Oct 16 20:13:03 1996

Date: Wed, 16 Oct 1996 20:10:16 -0400 (EDT)
From: Bill Cattey <wdc@MIT.EDU>
To: "Naomi B. Schmidt" <nschmidt@MIT.EDU>, Jonathon Weiss <jweiss@MIT.EDU>,
        dcns-cluster@MIT.EDU, ops@MIT.EDU
Cc: ghudson@MIT.EDU, vrt@MIT.EDU, miki@MIT.EDU, jweiss@MIT.EDU,
        release-team@MIT.EDU, jis@MIT.EDU, hoffmann@MIT.EDU
In-Reply-To: <199610162031.QAA07120@tla.MIT.EDU>

From an initial fear that the recent patch release had wrought
widespread havoc, we have confirmed that two machines were affected,
both non-standard hardware configurations:  

Bob Mahoney's machine has again developed problems with audio.
Ron Hoffmann's machine is still down.

Earlier reports of problems installing: one machine, STROBE is still
under investigation, but the cause of faults in Building W20 and 4-035
are now understood (and they were PREVIOUS to the patch release):

Before the patch release went out, whirr lost a disk partition.

It turned out that that partition had contained the bits to do phase two
of the SGI install.  Two machines (and probably STROBE as well which we
will confirm tomorrow) did phase 1, and then died because the majority
of the files were unavailable for install.

We should not have had the install server configured to use the Dev cell
for installs, and when Ops was told that the partitions on whirr could
stand to be down overnight, it was based on the incorrect assumption
that we were installing from the Athena cell.  At tomorrow's release
team meeting we will begin discussion of how to make sure our install
servers are installing from where they should be.

There are issues about: 
	confidence about the testing of updates
	robustness of installs
	robustness of updates
	coverage across updates
	coverage in problem situations
	accurate reporting of problem situations
	information sharing on problem situations
	determination of severity and scope of problem situations
	maintenance of the install servers
which this situation has either pointed out or re-raised.  Mike, and
Jonathon, and I have already begun discussing what can be done to
improve in these areas.

Unfortunately, some of this situation was due to too few warm bodies to
cover all the bases, or to function as backup when the solely
responsible party was away.

I'll send out another note when we have determined some actions to take.
If you don't see such a note from me in a week or so, you are welcome to
visit my office and REMIND ME of this promise.

-wdc

home help back first fref pref prev next nref lref last post