[758] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: Working the SGI issues at the Release Team Meeting.

daemon@ATHENA.MIT.EDU (Bill Cattey)
Thu Oct 17 14:38:31 1996

Date: Thu, 17 Oct 1996 14:38:12 -0400 (EDT)
From: Bill Cattey <wdc@MIT.EDU>
To: release-team@MIT.EDU
Cc: dcns-cluster@MIT.EDU, ops@MIT.EDU, jis@MIT.EDU, coc@MIT.EDU, rar@MIT.EDU,
        rferrara@MIT.EDU, vkumar@MIT.EDU, hoffmann@MIT.EDU

Background:

Coinciding with an SGI patch release yesterday, problems occured with
the installation workstations raising concerns about the procedures for
testing, delivering, and debugging releases.  In a note some of you have
seen, I said I'd send something out in a week or so detailing plans to
deal the issues raised.  Here it is the next day!  - wdc

Improvements

	* The software used for installing and updating SGI workstations
	  needs better error detection and reporting.
	        - ghudson will look into this

	* Notification of when updates will occur needs to be earlier
	  and more widespread.
	        - see Notification Procedure below

	* A defined group of people must agree upon the timing of a
	  release.  (We decided that the Release team was indeed the
	  proper group, recognizing that Cluster Support was a major
	  stakeholder)
		- see Notification Procedure below.

	* Athena Operations needs to know how about the configuration of
	  the SGI boot server, pepe-lepew.  The boot server is vital to
	  the installation of workstations.
	        - vrt is working on this.

	* There should be a single bootp front end for installs, to improve
	  robustness, and maintainability.
		- Mike Whitson is working on this (delayed due to his
		  courseworkload.) 

	* In order to expand the number of people who can deal with
	  potential problems, the lore of SGI's must be more widely
	  disseminated. 
	        - We need more warm bodies.  Bill gets some training.
	          Ops gets more training.

	* In order to effectively diagnose problems, operations and
	  development personnel need more details when a problem is
	  seen.
		- We'll work with Cluster Support on this.

	* There should be an identified first contact in the event of
	  trouble. 
	        - see Call Chain below

	* There should be a well defined order of contacts for dealing
	  with coverage, backup, and problem escalation.
	        - see Call Chain below

Call Chain:

	Customer -> Cluster -> Athena Server Ops -> Athena Soft Suppt

	This process requires that:
	        Cluster is informed about the state of the system.
	        Cluster goes to Athena Server Operations with problems.
	        When ASO can't solve the problem that there is a
	                person in dev who will take the ball and
	                get the problem solved.
	        After the problem is solved, Cluster goes to ASO
	                for the next problem, not the person in Dev.
	        Mike and Bill are on the hook to make sure the right
	                thing happens in Dev.


Notification Procedure:

	Release team is told that an update is ready.
	Folks in the team confirm the intended delivery date is sane.
	Note: There is an expectation that testing of incremental
	        releases is going on for at least a week in the
	        Dev Cell of incremental releases.
	(Sanity includes that there are people in ops and dev who
	are known to be available to help in the event of trouble.)
	Cluster support MUST be told of an impending release, and
	        when the release has ACTUALLY happened.
	(Next week Oliver will have an inventory of a good set of
	lists to send notice to and a good rule of thumb as to how
	early notice goes out. The group will discuss, amend, and
	appropriately adopt that list of lists and that rule of thumb
	as a formal notification procedure.)

-wdc

home help back first fref pref prev next nref lref last post