FYI Netflix is down

AP NANOG nanog at armoredpackets.com
Mon Jul 2 19:32:52 UTC 2012


I believe in my dictionary Chaos Gorilla translates into "Time To Go 
Home", with a rough definition of "Everything just crapped out - The 
world is ending"; but then again I may have hat incorrect :-)

-- 

Thank you,

Robert Miller
http://www.armoredpackets.com

Twitter: @arch3angel

On 7/2/12 2:59 PM, Paul Graydon wrote:
> On 07/02/2012 08:53 AM, Tony McCrory wrote:
>> On 2 July 2012 19:20, Cameron Byrne <cb.list6 at gmail.com> wrote:
>>
>>> Make your chaos animal go after sites and regions instead of individual
>>> VMs.
>>>
>>> CB
>>>
>>  From a previous post mortem
>> http://techblog.netflix.com/2011_04_01_archive.html
>>
>> "
>> Create More Failures
>> Currently, Netflix uses a service called "Chaos
>> Monkey<http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html>" 
>>
>> to simulate service failure. Basically, Chaos Monkey is a service that
>> kills other services. We run this service because we want engineering 
>> teams
>> to be used to a constant level of failure in the cloud. Services should
>> automatically recover without any manual intervention. We don't however,
>> simulate what happens when an entire AZ goes down and therefore we 
>> haven't
>> engineered our systems to automatically deal with those sorts of 
>> failures.
>> Internally we are having discussions about doing that and people are
>> already starting to call this service "Chaos Gorilla".
>> *"*
>>
>> It would seem the Gorilla hasn't quite matured.
>>
>> Tony
> From conversations with Adrian Cockcroft this weekend it wasn't the 
> result of Chaos Gorilla or Chaos Monkey failing to prepare them 
> adequately.  All their automated stuff worked perfectly, the 
> infrastructure tried to self heal.  The problem was that yet again 
> Amazon's back-plane / control-plane was unable to cope with the 
> requests.  Netflix uses Amazon's ELB to balance the traffic and no 
> back-plane meant they were unable to reconfigure it to route around 
> the problem.
>
> Paul
>
>





More information about the NANOG mailing list