[154350] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: FYI Netflix is down

daemon@ATHENA.MIT.EDU (George Herbert)
Mon Jul 2 17:05:07 2012

In-Reply-To: <E1SlmXb-0005DF-5u@mailman.nanog.org>
Date: Mon, 2 Jul 2012 14:04:08 -0700
From: George Herbert <george.herbert@gmail.com>
To: "Greg D. Moore" <mooregr@greenms.com>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

On Mon, Jul 2, 2012 at 12:43 PM, Greg D. Moore <mooregr@greenms.com> wrote:
> At 03:08 PM 7/2/2012, George Herbert wrote:
>
> If folks have not read it, I would suggest reading Normal Accidents by
> Charles Perrow.
>
> The "it can't happen" is almost guaranteed to happen. ;-) =A0And when it =
does,
> it'll often interact in ways we can't predict or sometimes even understan=
d.

Seconded.

There are also aerospace and nuclear and failure analysis books which
are good, but I often encourage people to start with that one.

> As for pulling the plug to test stuff. I recall a demo at Netapps in the
> early 00's. =A0They were talking about their fault tolerance and how grea=
t it
> was. =A0So I walked up to their demo array and said, "So, it shouldn't be=
 a
> problem if I pulled this drive right here?" =A0Before I could the salespe=
rson
> or tech guy, can't remember, =A0told me to stop. =A0He didn't want to ris=
k it.
>
> That right there said loads about their confidence in their own system.

I worked for a Sun clone vendor (Axil) for a while and took some of
our systems and storage to Comdex one year in the 90s.  We had a RAID
unit (Mylex controller) we had just introduced.  Beforehand, I made
REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power
tricks worked.  And showed them to people with the "Please keep in
mind that this voids the warranty, but here we *rip* go...".  All of
the other server vendors were giving me dirty looks for that one.
Apparently I sold a few systems that way.

You have to watch for connector wear-out and things like that, but ...

All the clusters I've built, I've insisted on a burn-in time plug pull
test on all the major components.  We caught things with those from
time to time.  Especially with N+1, if it is really N+0 due to a bug
or flaw you need to know that...


--=20
-george william herbert
george.herbert@gmail.com


home help back first fref pref prev next nref lref last post