[7614] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: cluster-machines without athinfod

daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Thu Aug 18 16:17:30 2011

Message-Id: <201108182017.p7IKHNuu009878@outgoing.mit.edu>
To: Benjamin Kaduk <kaduk@mit.edu>
cc: Jonathon Weiss <jweiss@mit.edu>, release-team@mit.edu
In-reply-to: <alpine.GSO.1.10.1108171820260.7526@multics.mit.edu>
Date: Thu, 18 Aug 2011 16:17:22 -0400
From: Jonathon Weiss <jweiss@MIT.EDU>


The list for the first group now looks like:

w20-575-85.mit.edu
w20-575-8.mit.edu
w20-575-28.mit.edu
w20-575-24.mit.edu
mccormick-7.mit.edu
mccormick-1.mit.edu
m38-370-4.mit.edu
m1-115-14.mit.edu
hayden-8.mit.edu
eos-8.mit.edu

that includes at least 5 machines that weren't there yesterday, so
something is triggering this problem relatively often.

We can ask hotline to reboot them, but this is clearly going to be a
constant chase, until the underlying cause is found/fixed.

	Jonathon




> On Wed, 17 Aug 2011, Jonathon Weiss wrote:
> 
> >
> > The following cluster/dorm machines are not running athinfod (nagios
> > reports "CRITICAL - Could not create socket: Connection refused"):
> >
> > w20-575-80.mit.edu
> > w20-575-48.mit.edu
> > mccormick-1.mit.edu
> > m66-080-19.mit.edu
> > m66-080-1.mit.edu
> > m1-115-14.mit.edu
> > m1-115-1.mit.edu
> > eos-8.mit.edu
> >
> > Nagios reports "(Service Check Timed Out)" for these, but the machine pings.
> > simmons-1.mit.edu
> > m1-115-21.mit.edu
> >
> > For either or both groups, does anyone want to investigate these at
> > all, or should I just request that hotline re-install them?
> 
> I don't think reinstallation will be necessary for most of them.
> I took a look at w20-575-{80,48} just now, and both were hung trying
> to shutdown/reboot, but a three-finger salute brought them back to
> life.
> The console messages post c-a-d seemed to indicate that the login
> chroot was somehow still around and "busy", which is probably what was
> causing the hang.
> I don't have thoughts for a good way to debug post-facto, though -- I
> think we may have to trigger the bug on a machine with an sshd or
> serial console in order to get much useful information.
> 
> I've seen machines hung in shutdown/reboot before, while wandering
> through clusters for other purposes; I suspect this issue is
> relatively common.
> 
> -Ben

home help back first fref pref prev next nref lref last post