[1081] in athena10

home help back first fref pref prev next nref lref last post

XVM Outage

daemon@ATHENA.MIT.EDU (Evan Broder)
Sat Feb 7 19:15:05 2009

Message-ID: <498E23C2.5040208@mit.edu>
Date: Sat, 07 Feb 2009 19:13:54 -0500
From: Evan Broder <broder@MIT.EDU>
Reply-To: xvm@MIT.EDU
MIME-Version: 1.0
To: xvm-contacts@mit.edu
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Today, from approximately 2:00 PM until about 6:45 PM, we experienced a
problem on shadow-moses, one of the four XVM host machines. Despite our
best efforts, we were unable to recover from the outage, and were forced
to reboot shadow-moses, losing state for 37 VMs that were running on
that server.

The outage also caused downtime for the xvm.mit.edu website.

For those of you interested in more technical information surrounding
that outage, at approximately 2:10 PM, we noticed thet xend on
shadow-moses stopped accepting new connections, causing programs like
`xm list` to hang indefinitely at the terminal. We attempted to restart
xend, but the old process didn't shutdown cleanly, making it impossible
to start a new daemon. Further investigation of the old xend revealed
that all but 3 of the threads were spinning in a futex(2) call on a
single location in memory. Our current suspicion is that some process
switched from a Python context into a C context, and then died for some
reason without releasing the GIL, causing all of the other processes to
deadlock while they waited for the GIL to be released.

We are very sorry for the inconvenience this caused for users of VMs on
shadow-moses, and for anyone attempting to use the website during the
outage.

 - Evan
   For the SIPB XVM team

home help back first fref pref prev next nref lref last post