FYI Netflix is down

Grant Ridder shortdudey123 at gmail.com
Mon Jul 2 16:42:27 UTC 2012


The problem is large scale tests take a lot of time and planning.  For it
to be done right, you really need a dedicated DR team.

-Grant

On Mon, Jul 2, 2012 at 11:31 AM, AP NANOG <nanog at armoredpackets.com> wrote:

> This is an excellent example of how tests "should" be ran, unfortunately
> far too many places don't do this...
>
>
> --
>
> Thank you,
>
> Robert Miller
> http://www.armoredpackets.com
>
> Twitter: @arch3angel
>
> On 7/2/12 12:09 PM, Leo Bicknell wrote:
>
>> In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd
>> Underwood wrote:
>>
>>> from the perspective of people watching B-rate movies:  this was a
>>> failure to implement and test a reliable system for streaming those
>>> movies in the face of a power outage at one facility.
>>>
>> I want to emphasize _and test_.
>>
>> Work on an infrastructure which is redundant and designed to provide
>> "100% uptime" (which is impossible, but that's another story) means
>> that there should be confidence in a failure being automatically
>> worked around, detected, and reported.
>>
>> I used to work with a guy who had a simple test for these things,
>> and if I was a VP at Amazon, Netflix, or any other large company I
>> would do the same.  About once a month he would walk out on the
>> floor of the data center and break something.  Pull out an ethernet.
>> Unplug a server.  Flip a breaker.
>>
>> Then he would wait, to see how long before a technician came to fix
>> it.
>>
>> If these activities were service impacting to customers the engineering
>> or implementation was faulty, and remediation was performed.  Assuming
>> they acted as designed and the customers saw no faults the team was
>> graded on how quickly the detected and corrected the outage.
>>
>> I've seen too many companies who's "test" is planned months in advance,
>> and who exclude the parts they think aren't up to scratch from the test.
>> Then an event occurs, and they fail, and take down customers.
>>
>> TL;DR If you're not confident your operation could withstand someone
>> walking into your data center and randomly doing something, you are
>> NOT redundant.
>>
>>
>



More information about the NANOG mailing list