[756] in Release_7.7_team
Status on SGI situation.
daemon@ATHENA.MIT.EDU (Bill Cattey)
Wed Oct 16 20:13:03 1996
Date: Wed, 16 Oct 1996 20:10:16 -0400 (EDT)
From: Bill Cattey <wdc@MIT.EDU>
To: "Naomi B. Schmidt" <nschmidt@MIT.EDU>, Jonathon Weiss <jweiss@MIT.EDU>,
dcns-cluster@MIT.EDU, ops@MIT.EDU
Cc: ghudson@MIT.EDU, vrt@MIT.EDU, miki@MIT.EDU, jweiss@MIT.EDU,
release-team@MIT.EDU, jis@MIT.EDU, hoffmann@MIT.EDU
In-Reply-To: <199610162031.QAA07120@tla.MIT.EDU>
From an initial fear that the recent patch release had wrought
widespread havoc, we have confirmed that two machines were affected,
both non-standard hardware configurations:
Bob Mahoney's machine has again developed problems with audio.
Ron Hoffmann's machine is still down.
Earlier reports of problems installing: one machine, STROBE is still
under investigation, but the cause of faults in Building W20 and 4-035
are now understood (and they were PREVIOUS to the patch release):
Before the patch release went out, whirr lost a disk partition.
It turned out that that partition had contained the bits to do phase two
of the SGI install. Two machines (and probably STROBE as well which we
will confirm tomorrow) did phase 1, and then died because the majority
of the files were unavailable for install.
We should not have had the install server configured to use the Dev cell
for installs, and when Ops was told that the partitions on whirr could
stand to be down overnight, it was based on the incorrect assumption
that we were installing from the Athena cell. At tomorrow's release
team meeting we will begin discussion of how to make sure our install
servers are installing from where they should be.
There are issues about:
confidence about the testing of updates
robustness of installs
robustness of updates
coverage across updates
coverage in problem situations
accurate reporting of problem situations
information sharing on problem situations
determination of severity and scope of problem situations
maintenance of the install servers
which this situation has either pointed out or re-raised. Mike, and
Jonathon, and I have already begun discussing what can be done to
improve in these areas.
Unfortunately, some of this situation was due to too few warm bodies to
cover all the bases, or to function as backup when the solely
responsible party was away.
I'll send out another note when we have determined some actions to take.
If you don't see such a note from me in a week or so, you are welcome to
visit my office and REMIND ME of this promise.
-wdc