FYI Netflix is down

Mike Devlin mdevlin at aisle10.net
Sat Jun 30 19:42:47 UTC 2012


The last 2 Amazon outages were power issues isolated to just there us-east
Virginia data center. I read somewhere that Amazon has something like 70%
of their ec2 resources in Virginia and its also their oldest ec2
datacenter..so I am guessing they learned a lot of lessons and are stuck
with an aged infrastructure there.

I think the real problem here is that a large subset of the customers using
ec2 misunderstand the redundancy that is built into the Amazon
architecture. You are essentially supposed to view individual virtual
machines as bring entirely disposable and make duplicates of everything
across availability zones and for extra points across regions.

most people instead think that the 2 cents/hour price tag is a massive cost
savings and the cloud is invincible..look at the SLA for ec2...Amazon
basically doesn't really consider it a real outage unless its more than one
availability zone that is down

whats more surprising is that netflix was so affected by a single
availability zone outage. They are constantly talking about their chaos
monkey/simian army tool that purposely breaks random parts of their
infrastructure to prove its fault tolerate, or to point out weaknesses to
fix. (
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html)


I think the closest thing to a cascading failure they have had was 4/29/11
outage (http://aws.amazon.com/message/65648/)


Mike


On Jun 30, 2012 3:05 PM, "Todd Underwood" <toddunder at gmail.com> wrote:

> This was not a cascading failure.  It was a simple power outage
>
> Cascading failures involve interdependencies among components.
>
> T
> On Jun 30, 2012 2:21 PM, "Seth Mattinen" <sethm at rollernet.us> wrote:
>
> > On 6/30/12 9:25 AM, Todd Underwood wrote:
> > >
> > > On Jun 30, 2012 11:23 AM, "Seth Mattinen" <sethm at rollernet.us
> > > <mailto:sethm at rollernet.us>> wrote:
> > >>
> > >>
> > >> But haven't they all been cascading failures?
> > >
> > > No.  They have not.  That's not what that term means.
> > >
> > > 'Cascading failure' has a fairly specific meaning that doesn't imply
> > > resilience in the face of decomposition into smaller parts.  Cascading
> > > failures can occur even when a system is decomposed into small parts,
> > > each of which is apparently well run.
> > >
> >
> >
> > I honestly have no idea how to parse that since it doesn't jive with my
> > practical view of a cascading failure.
> >
> > ~Seth
> >
> >
>



More information about the NANOG mailing list