[98131] in North American Network Operators' Group
RE: more on SF outage
daemon@ATHENA.MIT.EDU (Peter Kranz)
Wed Jul 25 02:45:32 2007
Reply-To: <pkranz@unwiredltd.com>
From: "Peter Kranz" <pkranz@unwiredltd.com>
To: <nanog@merit.edu>
In-Reply-To: <200707250226.l6P2Qdb29526@glisan.hevanet.com>
Date: Tue, 24 Jul 2007 21:06:50 -0700
Errors-To: owner-nanog@merit.edu
Once the final analysis of this event is provided, it is likely going to be
due to a failure of one of the redundant systems to handle the event as
designed due to a software or other low level failure. It's a very complex
system designed to exceed anything in the region as far as redundancy goes,
but as a result it's got a lot of moving parts, and like the space shuttle,
can fail unexpectedly. You can bet engineering is scratching their head and
calling in the vendors to figure out what went wrong. Last time this
occurred it took weeks to pinpoint the root cause.