[74232] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Tornados in Ashburn (Equinix affected)

daemon@ATHENA.MIT.EDU (Deepak Jain)
Sat Sep 18 23:30:01 2004

Date: Sat, 18 Sep 2004 23:29:09 -0400
From: Deepak Jain <deepak@ai.net>
To: Sean Donelan <sean@donelan.com>
Cc: nanog@merit.edu
In-Reply-To: <Pine.GSO.4.58.0409182045030.16939@clifden.donelan.com>
Errors-To: owner-nanog-outgoing@merit.edu


> Despite marketing departments, engineers know there will be failures.
> A N+1 design means two faults will result in an interruption.  A N+2
> design means three faults wil result in an interruption.  And so on.

Only caveat here (that I want to add) is this:

1) No matter what the company, no matter what the design, N+x doesn't 
necessarily mean >x failures have to occur at all, or even simultaneously.

2) Just because a design is believed to be N+x or yN doesn't mean all 
single points of failure are really eliminated. N+x or yN implies that 
the failures they planned for have to be >(y-1)N or >x. Doesn't mean 
that they have planned for every possibile failure mode. For example, 
static transfer switches can and do fail. Even when they are in pairs, 
the coupling mechanisms and paralleling mechanisms often don't work and 
aren't easy to repair/bypass in an emergency.

3) Many new systems [say datacenters built/upgraded in the last 5 years] 
haven't been around long enough to really test 99.999% and above levels 
of availability... many new systems won't start showing problems for 
5-10 years.

Specifically in Equinix's case:

1) Good that they [seemed] to have maintained partial power.

2) Good that they restored cooling [power to the blowers?] relatively 
quickly. By the graph someone posted and their message, it looks like 
their chillers were on an unaffected system, but their blowers weren't 
[as in, were affected].

3) Good that they seemed to be able to bring together enough 
knowledgeable folks quickly to resolve the problems that did occur 
relatively quickly.

4) SLA credits. Depending on your contract, even possible breach unless 
they can prove >x or >(y-1)N failures had occurred in their physical 
plant. The latter is only useful if you want to get out of Equinix/Ash 
or reduce your commits to it.

Deepak Jain
AiNET


home help back first fref pref prev next nref lref last post