[1081] in athena10
XVM Outage
daemon@ATHENA.MIT.EDU (Evan Broder)
Sat Feb 7 19:15:05 2009
Message-ID: <498E23C2.5040208@mit.edu>
Date: Sat, 07 Feb 2009 19:13:54 -0500
From: Evan Broder <broder@MIT.EDU>
Reply-To: xvm@MIT.EDU
MIME-Version: 1.0
To: xvm-contacts@mit.edu
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Today, from approximately 2:00 PM until about 6:45 PM, we experienced a
problem on shadow-moses, one of the four XVM host machines. Despite our
best efforts, we were unable to recover from the outage, and were forced
to reboot shadow-moses, losing state for 37 VMs that were running on
that server.
The outage also caused downtime for the xvm.mit.edu website.
For those of you interested in more technical information surrounding
that outage, at approximately 2:10 PM, we noticed thet xend on
shadow-moses stopped accepting new connections, causing programs like
`xm list` to hang indefinitely at the terminal. We attempted to restart
xend, but the old process didn't shutdown cleanly, making it impossible
to start a new daemon. Further investigation of the old xend revealed
that all but 3 of the threads were spinning in a futex(2) call on a
single location in memory. Our current suspicion is that some process
switched from a Python context into a C context, and then died for some
reason without releasing the GIL, causing all of the other processes to
deadlock while they waited for the GIL to be released.
We are very sorry for the inconvenience this caused for users of VMs on
shadow-moses, and for anyone attempting to use the website during the
outage.
- Evan
For the SIPB XVM team