[154705] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: FYI Netflix is down

daemon@ATHENA.MIT.EDU (Rayson Ho)
Mon Jul 9 11:50:58 2012

In-Reply-To: <CAK4no04vboQ9-XWmB=TR1XhWWARvT8kEJbLPy8Z2DoCsoxHFsQ@mail.gmail.com>
Date: Mon, 9 Jul 2012 11:50:06 -0400
From: Rayson Ho <raysonlogin@gmail.com>
To: "steve pirk [egrep]" <steve@pirk.com>
Cc: Ryan Malayter <malayter@gmail.com>, nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

On Sun, Jul 8, 2012 at 8:27 PM, steve pirk [egrep] <steve@pirk.com> wrote:
> I am pretty sure Netflix and others were "trying to do it right", as they
> all had graceful fail-over to a secondary AWS zone defined.
> It looks to me like Amazon uses DNS round-robin to load balance the zones=
,
> because they mention returning a "list" of addresses for DNS queries, and
> explains the failure of the services to shunt over to other zones in thei=
r
> postmortem.

There are also bugs from the Netflix side uncovered by the AWS outage:

"Lessons Netflix Learned from the AWS Storm"

http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.=
html

For an infrastructure this large, no matter you are running your own
datacenter or using the cloud, it is certain that the code is not bug
free. And another thing is, if everything is too automated, then
failure in one component can trigger bugs in areas that no one has
ever thought of...

Rayson

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/






>> Elastic Load Balancers (ELBs) allow web traffic directed at a single IP
>> address to be spread across many EC2 instances. They are a tool for high
>> availability as traffic to a single end-point can be handled by many
>> redundant servers. ELBs live in individual Availability Zones and front =
EC2
>> instances in those same zones or in other Availability Zones.
>
>
>
>> ELBs can also be deployed in multiple Availability Zones. In this
>> configuration, each Availability Zone=92s end-point will have a separate=
 IP
>> address. A single Domain Name will point to all of the end-points=92 IP
>> addresses. When a client, such as a web browser, queries DNS with a Doma=
in
>> Name, it receives the IP address (=93A=94) records of all of the ELBs in=
 random
>> order. While some clients only process a single IP address, many (such a=
s
>> newer versions of web-browsers) will retry the subsequent IP addresses i=
f
>> they fail to connect to the first. A large number of non-browser clients
>> only operate with a single IP address.
>> During the disruption this past Friday night, the control plane (which
>> encompasses calls to add a new ELB, scale an ELB, add EC2 instances to a=
n
>> ELB, and remove traffic from ELBs) began performing traffic shifts to
>> account for the loss of load balancers in the affected Availability Zone=
.
>> As the power and systems returned, a large number of ELBs came up in a
>> state which triggered a bug we hadn=92t seen before. The bug caused the =
ELB
>> control plane to attempt to scale these ELBs to larger ELB instance size=
s.
>> This resulted in a sudden flood of requests which began to backlog the
>> control plane. At the same time, customers began launching new EC2
>> instances to replace capacity lost in the impacted Availability Zone,
>> requesting the instances be added to existing load balancers in the othe=
r
>> zones. These requests further increased the ELB control plane backlog.
>> Because the ELB control plane currently manages requests for the US East=
-1
>> Region through a shared queue, it fell increasingly behind in processing
>> these requests; and pretty soon, these requests started taking a very lo=
ng
>> time to complete.
>>
>  http://aws.amazon.com/message/67457/
>
>
>> *In reality, though, Amazon data centers have outages all the time. In
>> fact, Amazon tells its customers to plan for this to happen, and to be
>> ready to roll over to a new data center whenever there=92s an outage.*
>>
>> *That=92s what was supposed to happen at Netflix Friday night. But it
>> didn=92t work out that way. According to Twitter messages from Netflix
>> Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer R=
ick
>> Branson, it looks like an Amazon Elastic Load Balancing service, designe=
d
>> to spread Netflix=92s processing loads across data centers, failed durin=
g the
>> outage. Without that ELB service working properly, the Netflix and Pintr=
est
>> services hosted by Amazon crashed.*
>
>  http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/
>
> I am a big believer in using hardware to load balance data centers, and n=
ot
> leave it up to software in the data center which might fail.
>
> Speaking of services like RightScale, Google announced Compute Engine at
> Google I/O this year. BuildFax was an early Adopter, and they gave it gre=
at
> reviews...
> http://www.youtube.com/watch?v=3DLCjSJ778tGU
>
> It looks like Google has entered into the VPS market. 'bout time... ;-]
> http://cloud.google.com/products/compute-engine.html
>
> --steve pirk


home help back first fref pref prev next nref lref last post