FYI Netflix is down

AP NANOG nanog at armoredpackets.com
Mon Jul 2 16:31:26 UTC 2012


This is an excellent example of how tests "should" be ran, unfortunately 
far too many places don't do this...

-- 

Thank you,

Robert Miller
http://www.armoredpackets.com

Twitter: @arch3angel

On 7/2/12 12:09 PM, Leo Bicknell wrote:
> In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood wrote:
>> from the perspective of people watching B-rate movies:  this was a
>> failure to implement and test a reliable system for streaming those
>> movies in the face of a power outage at one facility.
> I want to emphasize _and test_.
>
> Work on an infrastructure which is redundant and designed to provide
> "100% uptime" (which is impossible, but that's another story) means
> that there should be confidence in a failure being automatically
> worked around, detected, and reported.
>
> I used to work with a guy who had a simple test for these things,
> and if I was a VP at Amazon, Netflix, or any other large company I
> would do the same.  About once a month he would walk out on the
> floor of the data center and break something.  Pull out an ethernet.
> Unplug a server.  Flip a breaker.
>
> Then he would wait, to see how long before a technician came to fix
> it.
>
> If these activities were service impacting to customers the engineering
> or implementation was faulty, and remediation was performed.  Assuming
> they acted as designed and the customers saw no faults the team was
> graded on how quickly the detected and corrected the outage.
>
> I've seen too many companies who's "test" is planned months in advance,
> and who exclude the parts they think aren't up to scratch from the test.
> Then an event occurs, and they fail, and take down customers.
>
> TL;DR If you're not confident your operation could withstand someone
> walking into your data center and randomly doing something, you are
> NOT redundant.
>




More information about the NANOG mailing list