[154328] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: FYI Netflix is down

daemon@ATHENA.MIT.EDU (Leo Bicknell)
Mon Jul 2 12:10:38 2012

Date: Mon, 2 Jul 2012 09:09:09 -0700
From: Leo Bicknell <bicknell@ufp.org>
To: nanog@nanog.org
Mail-Followup-To: nanog@nanog.org
In-Reply-To: <CAB2RJygUQ5ESTDoz6FNOpcvxsi6xUoTnYrDgum_qSGBu88B4Yg@mail.gmail.com>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org


--EVF5PPMfhYS0aIcm
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwo=
od wrote:
> from the perspective of people watching B-rate movies:  this was a
> failure to implement and test a reliable system for streaming those
> movies in the face of a power outage at one facility.

I want to emphasize _and test_.

Work on an infrastructure which is redundant and designed to provide
"100% uptime" (which is impossible, but that's another story) means
that there should be confidence in a failure being automatically
worked around, detected, and reported.

I used to work with a guy who had a simple test for these things,
and if I was a VP at Amazon, Netflix, or any other large company I
would do the same.  About once a month he would walk out on the
floor of the data center and break something.  Pull out an ethernet.
Unplug a server.  Flip a breaker.

Then he would wait, to see how long before a technician came to fix
it.

If these activities were service impacting to customers the engineering
or implementation was faulty, and remediation was performed.  Assuming
they acted as designed and the customers saw no faults the team was
graded on how quickly the detected and corrected the outage.

I've seen too many companies who's "test" is planned months in advance,
and who exclude the parts they think aren't up to scratch from the test.
Then an event occurs, and they fail, and take down customers.

TL;DR If you're not confident your operation could withstand someone
walking into your data center and randomly doing something, you are
NOT redundant.

--=20
       Leo Bicknell - bicknell@ufp.org - CCIE 3440
        PGP keys at http://www.ufp.org/~bicknell/

--EVF5PPMfhYS0aIcm
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (FreeBSD)

iQIVAwUBT/HHpbN3O8aJIdTMAQKbKA//ZpknlZdhnG8gcRAKmnQx+bYwDpPcDHuy
rulSdJnNmGPR4zXVgoICe+j9zif0WeVJJTtD8HQSLCUX/I9Mk/Ml2RjRQcS1dMYq
MP92NrYkmDAjje+QF7y3CDZk6saVYzlABXOTWVccJMlngMGj8swPAOl0dGgoaWR1
xWbaijoFht7l5gFD5A5U3UxdhKCtkml4hKwr6lG+9UA3Dn9UlfhybWXGjJ/4I27s
BkLhjwgWwRnUmrBcjjtWp6/4c5QL1l53F1MvFvUfqCs9b34oN0LPvb/sDLl59UF1
3F72Y1PD0FNZAh6pEjNK1KsSGgasX+o4Pr+M7beT0iSFshNhikEN2UVMuPMocWB6
BtOEu7aaxwYLfYP3IQjK0zc5l2etzsdk/PIMd2xfFsi0cJ2xz+5hiOB29RXDkZLU
iiliMucKLExZp1nya3rOD4aJzo1zkpBIRz1UfecxrOcxL1bLDa/tjt83dvkx4gWs
noMuVjqYOjX9epGYtfp87LnE7qqQaDOj3sQiJ7Zr7F4VT40CNAb26PNIEcNlGpTI
qsmB8Qg8fMHyuIzSsK5ztmjErH1Pr/8PSE/TQw/k/ahX6YeuIZSPSj3InoZZ3uVu
SvOPkvQCnBQKMFqc0x3OSYwPFK94aY6QvhuKKq6oahS18GuMT7vltSEdx4JFPD+h
EAMtZ4Lbcdc=
=W9s0
-----END PGP SIGNATURE-----

--EVF5PPMfhYS0aIcm--


home help back first fref pref prev next nref lref last post