[7615] in Release_7.7_team
Re: cluster-machines without athinfod
daemon@ATHENA.MIT.EDU (Jonathan Reed)
Thu Aug 18 17:05:59 2011
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset=us-ascii
From: Jonathan Reed <jdreed@MIT.EDU>
In-Reply-To: <201108182017.p7IKHNuu009878@outgoing.mit.edu>
Date: Thu, 18 Aug 2011 17:05:51 -0400
Cc: Benjamin Kaduk <kaduk@mit.edu>, release-team@mit.edu
Message-Id: <0F7407D1-3BA2-40A4-9C5F-783C8BFA5244@mit.edu>
To: Jonathon Weiss <jweiss@mit.edu>
Content-Transfer-Encoding: 8bit
Hrm. I think I want to blame the kexec reboot in reactivate. Should be easy to check with logs, however.
But I'm out tomorrow, and won't get to this until Monday.
Reboots are fast enough (thanks to Upstart) that we should maybe just drop that.
However, if something on Natty is causing processes to stick around inside the chroot longer than usual, we should investigate.
-Jon
On Aug 18, 2011, at 4:17 PM, Jonathon Weiss wrote:
>
> The list for the first group now looks like:
>
> w20-575-85.mit.edu
> w20-575-8.mit.edu
> w20-575-28.mit.edu
> w20-575-24.mit.edu
> mccormick-7.mit.edu
> mccormick-1.mit.edu
> m38-370-4.mit.edu
> m1-115-14.mit.edu
> hayden-8.mit.edu
> eos-8.mit.edu
>
> that includes at least 5 machines that weren't there yesterday, so
> something is triggering this problem relatively often.
>
> We can ask hotline to reboot them, but this is clearly going to be a
> constant chase, until the underlying cause is found/fixed.
>
> Jonathon
>
>
>
>
>> On Wed, 17 Aug 2011, Jonathon Weiss wrote:
>>
>>>
>>> The following cluster/dorm machines are not running athinfod (nagios
>>> reports "CRITICAL - Could not create socket: Connection refused"):
>>>
>>> w20-575-80.mit.edu
>>> w20-575-48.mit.edu
>>> mccormick-1.mit.edu
>>> m66-080-19.mit.edu
>>> m66-080-1.mit.edu
>>> m1-115-14.mit.edu
>>> m1-115-1.mit.edu
>>> eos-8.mit.edu
>>>
>>> Nagios reports "(Service Check Timed Out)" for these, but the machine pings.
>>> simmons-1.mit.edu
>>> m1-115-21.mit.edu
>>>
>>> For either or both groups, does anyone want to investigate these at
>>> all, or should I just request that hotline re-install them?
>>
>> I don't think reinstallation will be necessary for most of them.
>> I took a look at w20-575-{80,48} just now, and both were hung trying
>> to shutdown/reboot, but a three-finger salute brought them back to
>> life.
>> The console messages post c-a-d seemed to indicate that the login
>> chroot was somehow still around and "busy", which is probably what was
>> causing the hang.
>> I don't have thoughts for a good way to debug post-facto, though -- I
>> think we may have to trigger the bug on a machine with an sshd or
>> serial console in order to get much useful information.
>>
>> I've seen machines hung in shutdown/reboot before, while wandering
>> through clusters for other purposes; I suspect this issue is
>> relatively common.
>>
>> -Ben