[154323] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: FYI Netflix is down

daemon@ATHENA.MIT.EDU (Todd Underwood)
Mon Jul 2 11:31:47 2012

In-Reply-To: <1C7B96053DD7814496A0D1E71661B68302CF5B79@SMF-ENTXM-001.sac.ragingwire.net>
From: Todd Underwood <toddunder@gmail.com>
Date: Mon, 2 Jul 2012 11:30:06 -0400
To: Dan Golding <dgolding@ragingwire.com>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

> Actually, it was a very complex power outage. I'm going to assume that wh=
at happened this weekend was similar to the event that happened at the same=
 facility approximately two weeks ago (its immaterial - the details are pro=
bably different, but it illustrates the complexity of a data center failure=
)
>
> Utility Power Failed
> First Backup Generator Failed (shut down due to a faulty fan)
> Second Backup Generator Failed (breaker coordination problem resulting in=
 faulty trip of a breaker)
>
> In this case, it was clearly a cascading failure, although only limited i=
n scope. The failure in this case, also clearly involved people. There was =
one material failure (the fan), but the system should have been resilient e=
nough to deal with it. The system should also have been resilient enough to=
 deal with the breaker coordination issue (which should not have occurred),=
 but was not. Data centers are not commodities. There is a way to engineer =
these facilities to be much more resilient. Not everyone's business model s=
upports it.

ok, i give in.  as some level of granularity everything is a cascading
failure (since molecules colide and the world is an infinite chain of
causation in which human free will is merely a myth </Spinoza>)

of course, this use of 'cascading' is vacuous and not useful anymore
since it applies to nearly every failure, but i'll go along with it.

from the perspective of a datacenter power engineer, this was a
cascading failure of a few small number of components.

from the perspective of every datacenter customer:  this was a power failur=
e.

from the perspective of people watching B-rate movies:  this was a
failure to implement and test a reliable system for streaming those
movies in the face of a power outage at one facility.

from the perspective of nanog mailing list readers:  this was an
interesting opportunity to speculate about failures about which we
have no data (as usual!).

can we all agree on those facts?

:-)

t


home help back first fref pref prev next nref lref last post