[758] in Release_7.7_team
Re: Working the SGI issues at the Release Team Meeting.
daemon@ATHENA.MIT.EDU (Bill Cattey)
Thu Oct 17 14:38:31 1996
Date: Thu, 17 Oct 1996 14:38:12 -0400 (EDT)
From: Bill Cattey <wdc@MIT.EDU>
To: release-team@MIT.EDU
Cc: dcns-cluster@MIT.EDU, ops@MIT.EDU, jis@MIT.EDU, coc@MIT.EDU, rar@MIT.EDU,
rferrara@MIT.EDU, vkumar@MIT.EDU, hoffmann@MIT.EDU
Background:
Coinciding with an SGI patch release yesterday, problems occured with
the installation workstations raising concerns about the procedures for
testing, delivering, and debugging releases. In a note some of you have
seen, I said I'd send something out in a week or so detailing plans to
deal the issues raised. Here it is the next day! - wdc
Improvements
* The software used for installing and updating SGI workstations
needs better error detection and reporting.
- ghudson will look into this
* Notification of when updates will occur needs to be earlier
and more widespread.
- see Notification Procedure below
* A defined group of people must agree upon the timing of a
release. (We decided that the Release team was indeed the
proper group, recognizing that Cluster Support was a major
stakeholder)
- see Notification Procedure below.
* Athena Operations needs to know how about the configuration of
the SGI boot server, pepe-lepew. The boot server is vital to
the installation of workstations.
- vrt is working on this.
* There should be a single bootp front end for installs, to improve
robustness, and maintainability.
- Mike Whitson is working on this (delayed due to his
courseworkload.)
* In order to expand the number of people who can deal with
potential problems, the lore of SGI's must be more widely
disseminated.
- We need more warm bodies. Bill gets some training.
Ops gets more training.
* In order to effectively diagnose problems, operations and
development personnel need more details when a problem is
seen.
- We'll work with Cluster Support on this.
* There should be an identified first contact in the event of
trouble.
- see Call Chain below
* There should be a well defined order of contacts for dealing
with coverage, backup, and problem escalation.
- see Call Chain below
Call Chain:
Customer -> Cluster -> Athena Server Ops -> Athena Soft Suppt
This process requires that:
Cluster is informed about the state of the system.
Cluster goes to Athena Server Operations with problems.
When ASO can't solve the problem that there is a
person in dev who will take the ball and
get the problem solved.
After the problem is solved, Cluster goes to ASO
for the next problem, not the person in Dev.
Mike and Bill are on the hook to make sure the right
thing happens in Dev.
Notification Procedure:
Release team is told that an update is ready.
Folks in the team confirm the intended delivery date is sane.
Note: There is an expectation that testing of incremental
releases is going on for at least a week in the
Dev Cell of incremental releases.
(Sanity includes that there are people in ops and dev who
are known to be available to help in the event of trouble.)
Cluster support MUST be told of an impending release, and
when the release has ACTUALLY happened.
(Next week Oliver will have an inventory of a good set of
lists to send notice to and a good rule of thumb as to how
early notice goes out. The group will discuss, amend, and
appropriately adopt that list of lists and that rule of thumb
as a formal notification procedure.)
-wdc