FYI Netflix is down

AP NANOG nanog at
Mon Jul 2 19:32:52 UTC 2012

I believe in my dictionary Chaos Gorilla translates into "Time To Go 
Home", with a rough definition of "Everything just crapped out - The 
world is ending"; but then again I may have hat incorrect :-)


Thank you,

Robert Miller

Twitter: @arch3angel

On 7/2/12 2:59 PM, Paul Graydon wrote:
> On 07/02/2012 08:53 AM, Tony McCrory wrote:
>> On 2 July 2012 19:20, Cameron Byrne <cb.list6 at> wrote:
>>> Make your chaos animal go after sites and regions instead of individual
>>> VMs.
>>> CB
>>  From a previous post mortem
>> "
>> Create More Failures
>> Currently, Netflix uses a service called "Chaos
>> Monkey<>" 
>> to simulate service failure. Basically, Chaos Monkey is a service that
>> kills other services. We run this service because we want engineering 
>> teams
>> to be used to a constant level of failure in the cloud. Services should
>> automatically recover without any manual intervention. We don't however,
>> simulate what happens when an entire AZ goes down and therefore we 
>> haven't
>> engineered our systems to automatically deal with those sorts of 
>> failures.
>> Internally we are having discussions about doing that and people are
>> already starting to call this service "Chaos Gorilla".
>> *"*
>> It would seem the Gorilla hasn't quite matured.
>> Tony
> From conversations with Adrian Cockcroft this weekend it wasn't the 
> result of Chaos Gorilla or Chaos Monkey failing to prepare them 
> adequately.  All their automated stuff worked perfectly, the 
> infrastructure tried to self heal.  The problem was that yet again 
> Amazon's back-plane / control-plane was unable to cope with the 
> requests.  Netflix uses Amazon's ELB to balance the traffic and no 
> back-plane meant they were unable to reconfigure it to route around 
> the problem.
> Paul

More information about the NANOG mailing list