[7612] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: cluster-machines without athinfod

daemon@ATHENA.MIT.EDU (Benjamin Kaduk)
Wed Aug 17 18:23:47 2011

Date: Wed, 17 Aug 2011 18:23:39 -0400 (EDT)
From: Benjamin Kaduk <kaduk@MIT.EDU>
To: Jonathon Weiss <jweiss@MIT.EDU>
cc: release-team@MIT.EDU
In-Reply-To: <201108172210.p7HMACTF014223@speaker-for-the-dead.mit.edu>
Message-ID: <alpine.GSO.1.10.1108171820260.7526@multics.mit.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

On Wed, 17 Aug 2011, Jonathon Weiss wrote:

>
> The following cluster/dorm machines are not running athinfod (nagios
> reports "CRITICAL - Could not create socket: Connection refused"):
>
> w20-575-80.mit.edu
> w20-575-48.mit.edu
> mccormick-1.mit.edu
> m66-080-19.mit.edu
> m66-080-1.mit.edu
> m1-115-14.mit.edu
> m1-115-1.mit.edu
> eos-8.mit.edu
>
> Nagios reports "(Service Check Timed Out)" for these, but the machine pings.
> simmons-1.mit.edu
> m1-115-21.mit.edu
>
> For either or both groups, does anyone want to investigate these at
> all, or should I just request that hotline re-install them?

I don't think reinstallation will be necessary for most of them.
I took a look at w20-575-{80,48} just now, and both were hung trying to 
shutdown/reboot, but a three-finger salute brought them back to life.
The console messages post c-a-d seemed to indicate that the login chroot 
was somehow still around and "busy", which is probably what was causing 
the hang.
I don't have thoughts for a good way to debug post-facto, though -- I 
think we may have to trigger the bug on a machine with an sshd or serial 
console in order to get much useful information.

I've seen machines hung in shutdown/reboot before, while wandering through 
clusters for other purposes; I suspect this issue is relatively common.

-Ben

home help back first fref pref prev next nref lref last post