[51542] in Hotline Meeting

home help back first fref pref prev next nref lref last post

Re: The SGI cluster walk

daemon@ATHENA.MIT.EDU (Robert A Basch)
Wed Jul 26 18:18:24 2000

Message-Id: <200007262218.SAA14155@m4-035-10.mit.edu>
To: Mitchell E Berger <mitchb@MIT.EDU>
Cc: zacheiss@MIT.EDU, cfox@MIT.EDU, hotline@MIT.EDU, rbasch@MIT.EDU
In-Reply-To: Your message of "Wed, 26 Jul 2000 09:58:53 EDT."
             <200007261358.JAA08638@w20-575-120.mit.edu> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Wed, 26 Jul 2000 18:18:18 -0400
From: Robert A Basch <rbasch@MIT.EDU>

I looked at the machines you reported...

Hotline:  W20-575-116 (no video) and W20-575-124 (bad disk) seem to have
hardware problems.

> Though Garry, Camilla, and company didn't have the energy to do the cluster
> walk early this morning, I had nothing else to do and no real transportation
> to get home, so I did the walk anyway.  Not everything broke, but a few did...
> this is the group that I wasn't able to solve by logging someone out.  If this
> information would be useful somewhere, feel free to send it there:
> 
> M4-035-9: It looked like it had scheduled an update for about an hour before
> I checked on it (which is over 3 hours ago now) and athinfo still says 8.3.

The disk was full on this machine, and so the update script properly
did not proceed with the update.  I found and nuked some huge audio files,
and the update started shortly after I logged out.

> W20-575-116: Seems to be running 8.4, but has no video signal.

It seems so.

> W20-575-115: Broken 8.3 - has no motd, root login hangs, complains about
> losing contact with volume location servers and hesiod connection refusals.
> Says no cluster info available.

There was a user on it when I got there, who claimed it was unplugged from
the network when he got there.  It's still running 8.3.

> W20-575-120: The machine I'm on now.  Someone was using it earlier.  It has
> an update scheduled for about 50 min from now.  I guess I'll leave it - maybe
> it'll update fine.

It updated later in the morning.

> W20-575-124: Hanging on boot with many errors about being unable to connect to
> fam.  Looks reasonably dead.

It was powered off when I got there.  I turned it on, and saw a bunch of
disk read errors, so left it powered off.

> W20-575-63: Had similar symptoms to the one above after I rebooted it.  Prior
> to that, the console was just spitting errors about being unable to open
> display: :0.0 continuously.  The guy working on this cluster (it's closed)
> says it gave him a login prompt and he shut it down, but I'm skeptical.

Not sure what happened here.  It went through just about all of the miniroot
setup, but did not set the prom to boot into it, and so rebooted off the
system disk.  Since it also did not remove the miniroot mount point, I'll
assume that something bad happened to the machine before the script completed.
When I set the prom to boot into the swap partition and rebooted, the
update proceeded, and looked healthy.

> W20-575-62: An 8.3 system that I think has been hacked.  It "failed to 
> activate properly"... i.e. doesn't attach packs.  I rebooted and saw several
> other random errors and it added the setuid root cells suspiciously slow.
> That guy working on the cluster, I believe, e-mailed hotline about this one.

This was being reinstalled.

Thanks,
Bob


home help back first fref pref prev next nref lref last post