FYI Netflix is down
shortdudey123 at gmail.com
Mon Jul 2 16:42:27 UTC 2012
The problem is large scale tests take a lot of time and planning. For it
to be done right, you really need a dedicated DR team.
On Mon, Jul 2, 2012 at 11:31 AM, AP NANOG <nanog at armoredpackets.com> wrote:
> This is an excellent example of how tests "should" be ran, unfortunately
> far too many places don't do this...
> Thank you,
> Robert Miller
> Twitter: @arch3angel
> On 7/2/12 12:09 PM, Leo Bicknell wrote:
>> In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd
>> Underwood wrote:
>>> from the perspective of people watching B-rate movies: this was a
>>> failure to implement and test a reliable system for streaming those
>>> movies in the face of a power outage at one facility.
>> I want to emphasize _and test_.
>> Work on an infrastructure which is redundant and designed to provide
>> "100% uptime" (which is impossible, but that's another story) means
>> that there should be confidence in a failure being automatically
>> worked around, detected, and reported.
>> I used to work with a guy who had a simple test for these things,
>> and if I was a VP at Amazon, Netflix, or any other large company I
>> would do the same. About once a month he would walk out on the
>> floor of the data center and break something. Pull out an ethernet.
>> Unplug a server. Flip a breaker.
>> Then he would wait, to see how long before a technician came to fix
>> If these activities were service impacting to customers the engineering
>> or implementation was faulty, and remediation was performed. Assuming
>> they acted as designed and the customers saw no faults the team was
>> graded on how quickly the detected and corrected the outage.
>> I've seen too many companies who's "test" is planned months in advance,
>> and who exclude the parts they think aren't up to scratch from the test.
>> Then an event occurs, and they fail, and take down customers.
>> TL;DR If you're not confident your operation could withstand someone
>> walking into your data center and randomly doing something, you are
>> NOT redundant.
More information about the NANOG