FYI Netflix is down

Dan Golding dgolding at ragingwire.com
Mon Jul 2 19:25:54 UTC 2012



> -----Original Message-----
> From: Leo Bicknell [mailto:bicknell at ufp.org]
> 

> 
> I want to emphasize _and test_.
[snip]
> 
> I used to work with a guy who had a simple test for these things, and
> if I was a VP at Amazon, Netflix, or any other large company I would
do
> the same.  About once a month he would walk out on the floor of the
> data center and break something.  Pull out an ethernet.
> Unplug a server.  Flip a breaker.
> 

*DING DING* - we have a winner! In a previous life, I used to spend a
lot of time in other people's data centers. The key question to ask was
how often they pulled the plug - i.e. disconnected utility power without
having backup generators running. Simulating an actual failure. That
goes for pulling out an Ethernet cord or unplugging a server, or
flipping a breaker. Its all the same. The problem is that if you don't
do this for a while, you get SCARED of doing it, and you stop doing it.
The longer you go without, the scarier it gets, to the point where you
will never do it, because you have no idea what will happen, other that
you probably getting fired. This is called "horrible engineering
management", and is very common.

The other problem, of course, is that people design under the assumption
that everything will always work, and that failure modes, when they
occur, are predictable and fall into a narrow set. Multiple failure
modes? Not tested. Failure modes including operator error? Never tested.


When was the last time you had a drill?

- Dan


> Then he would wait, to see how long before a technician came to fix
it.
> 
> If these activities were service impacting to customers the
engineering
> or implementation was faulty, and remediation was performed.  Assuming
> they acted as designed and the customers saw no faults the team was
> graded on how quickly the detected and corrected the outage.
> 
> I've seen too many companies who's "test" is planned months in
advance,
> and who exclude the parts they think aren't up to scratch from the
> test.
> Then an event occurs, and they fail, and take down customers.
> 
> TL;DR If you're not confident your operation could withstand someone
> walking into your data center and randomly doing something, you are
NOT
> redundant.
> 
> --
>        Leo Bicknell - bicknell at ufp.org - CCIE 3440
>         PGP keys at http://www.ufp.org/~bicknell/




More information about the NANOG mailing list