[1350] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Report on 4-035 cluster outage.

daemon@ATHENA.MIT.EDU (Bill Cattey)
Fri Jul 10 18:18:47 1998

Date: Fri, 10 Jul 1998 18:18:14 -0400 (EDT)
From: Bill Cattey <wdc@MIT.EDU>
To: nschmidt@MIT.EDU, fuzzballs@MIT.EDU, release-team@MIT.EDU,
        cluster-services@MIT.EDU
Cc: ashein@boston.sgi.com

Lou Isgur, Bob Basch, Andy Shein and I just got back from 4-035. 
Apparently at 9:19 AM all but one of the machines there hung.  Our
theory about what happened goes like this:

	Subnet 18.53 had a lot of collisions at that time.
	A bug in the IRIX 6.3 ethernet driver hung the machines.
	One machine did not hang.
	Four others had been rebooted before we arrived.

The one machine that did not hang somehow had been inadvertently updated
to Athena 8.2. We guess this machine auto-updated itself during a brief
period when the cluster information was wrong.  It has the right cluster
information now, and we found it with 8.2 installed but 8.1 packs
attached.

Andy Shein (whom Lou and I happened across on our way to check out the
situation, and whom we invited to help us figure things out) called our
attention to a patch in the Ethernet driver that seemed relevant.  Sure
enough, the patch is in Athena 8.2, but not in Athena 8.1.  Special
thanks Andy!

Actions to take:

Bob Basch is now investigating what the releative impact would be to either:
	Add the patch to Athena 8.1
	Update the whole cluster to Athena 8.2 in advance of having completed
testing of it.

We still gotta figure out to do about the one machine that has Athena
8.2 installed, but is attaching Athena 8.1 packs. Bob opines that
although this makes it resistant to the ethernet problem, it might
result in hard-to-identify problems relating to the mismatch.

-wdc

home help back first fref pref prev next nref lref last post