[7613] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: cluster-machines without athinfod

daemon@ATHENA.MIT.EDU (Jonathan Reed)
Wed Aug 17 18:36:40 2011

Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset=us-ascii
From: Jonathan Reed <jdreed@MIT.EDU>
In-Reply-To: <alpine.GSO.1.10.1108171820260.7526@multics.mit.edu>
Date: Wed, 17 Aug 2011 18:36:33 -0400
Cc: Jonathon Weiss <jweiss@mit.edu>, release-team@mit.edu
Message-Id: <57FE9A28-6F89-42A6-86E8-220AF87D91D4@mit.edu>
To: Benjamin Kaduk <kaduk@mit.edu>
Content-Transfer-Encoding: 8bit

nmap suggested that portmap was still running, but nothing else was.  That may indicate where in the shutdown sequence (is there still a sequence in the Upstart world?) it was failing.   

-Jon

On Aug 17, 2011, at 6:23 PM, Benjamin Kaduk wrote:

> On Wed, 17 Aug 2011, Jonathon Weiss wrote:
> 
>> 
>> The following cluster/dorm machines are not running athinfod (nagios
>> reports "CRITICAL - Could not create socket: Connection refused"):
>> 
>> w20-575-80.mit.edu
>> w20-575-48.mit.edu
>> mccormick-1.mit.edu
>> m66-080-19.mit.edu
>> m66-080-1.mit.edu
>> m1-115-14.mit.edu
>> m1-115-1.mit.edu
>> eos-8.mit.edu
>> 
>> Nagios reports "(Service Check Timed Out)" for these, but the machine pings.
>> simmons-1.mit.edu
>> m1-115-21.mit.edu
>> 
>> For either or both groups, does anyone want to investigate these at
>> all, or should I just request that hotline re-install them?
> 
> I don't think reinstallation will be necessary for most of them.
> I took a look at w20-575-{80,48} just now, and both were hung trying to shutdown/reboot, but a three-finger salute brought them back to life.
> The console messages post c-a-d seemed to indicate that the login chroot was somehow still around and "busy", which is probably what was causing the hang.
> I don't have thoughts for a good way to debug post-facto, though -- I think we may have to trigger the bug on a machine with an sshd or serial console in order to get much useful information.
> 
> I've seen machines hung in shutdown/reboot before, while wandering through clusters for other purposes; I suspect this issue is relatively common.
> 
> -Ben



home help back first fref pref prev next nref lref last post