[154340] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: FYI Netflix is down

daemon@ATHENA.MIT.EDU (Paul Graydon)
Mon Jul 2 15:01:33 2012

Date: Mon, 02 Jul 2012 08:59:57 -1000
From: Paul Graydon <paul@paulgraydon.co.uk>
To: nanog@nanog.org
In-Reply-To: <CAG9TQPm4_J-c219GLVxRcrNbz4kM0oaTrU5-PicCZXUiPZhFSw@mail.gmail.com>
X-SA-Exim-Mail-From: paul@paulgraydon.co.uk
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

On 07/02/2012 08:53 AM, Tony McCrory wrote:
> On 2 July 2012 19:20, Cameron Byrne <cb.list6@gmail.com> wrote:
>
>> Make your chaos animal go after sites and regions instead of individual
>> VMs.
>>
>> CB
>>
>  From a previous post mortem
> http://techblog.netflix.com/2011_04_01_archive.html
>
> "
> Create More Failures
> Currently, Netflix uses a service called "Chaos
> Monkey<http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html>"
> to simulate service failure. Basically, Chaos Monkey is a service that
> kills other services. We run this service because we want engineering teams
> to be used to a constant level of failure in the cloud. Services should
> automatically recover without any manual intervention. We don't however,
> simulate what happens when an entire AZ goes down and therefore we haven't
> engineered our systems to automatically deal with those sorts of failures.
> Internally we are having discussions about doing that and people are
> already starting to call this service "Chaos Gorilla".
> *"*
>
> It would seem the Gorilla hasn't quite matured.
>
> Tony
 From conversations with Adrian Cockcroft this weekend it wasn't the 
result of Chaos Gorilla or Chaos Monkey failing to prepare them 
adequately.  All their automated stuff worked perfectly, the 
infrastructure tried to self heal.  The problem was that yet again 
Amazon's back-plane / control-plane was unable to cope with the 
requests.  Netflix uses Amazon's ELB to balance the traffic and no 
back-plane meant they were unable to reconfigure it to route around the 
problem.

Paul


home help back first fref pref prev next nref lref last post