[154339] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: FYI Netflix is down

daemon@ATHENA.MIT.EDU (Tony McCrory)
Mon Jul 2 14:54:28 2012

In-Reply-To: <CAD6AjGQjBHBxCgPEFkqitDcGT16GfsCAc=MBz_dacKP1Z6K9oQ@mail.gmail.com>
Date: Mon, 2 Jul 2012 19:53:32 +0100
From: Tony McCrory <tony.mccrory@gmail.com>
To: Cameron Byrne <cb.list6@gmail.com>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

On 2 July 2012 19:20, Cameron Byrne <cb.list6@gmail.com> wrote:

>
> Make your chaos animal go after sites and regions instead of individual
> VMs.
>
> CB
>

From a previous post mortem
http://techblog.netflix.com/2011_04_01_archive.html

"
Create More Failures
Currently, Netflix uses a service called "Chaos
Monkey<http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html>"
to simulate service failure. Basically, Chaos Monkey is a service that
kills other services. We run this service because we want engineering teams
to be used to a constant level of failure in the cloud. Services should
automatically recover without any manual intervention. We don't however,
simulate what happens when an entire AZ goes down and therefore we haven't
engineered our systems to automatically deal with those sorts of failures.
Internally we are having discussions about doing that and people are
already starting to call this service "Chaos Gorilla".
*"*

It would seem the Gorilla hasn't quite matured.

Tony

home help back first fref pref prev next nref lref last post