FYI Netflix is down

Greg D. Moore mooregr at greenms.com
Mon Jul 2 19:43:29 UTC 2012


At 03:08 PM 7/2/2012, George Herbert wrote:

If folks have not read it, I would suggest reading Normal Accidents 
by Charles Perrow.

The "it can't happen" is almost guaranteed to happen. ;-)  And when 
it does, it'll often interact in ways we can't predict or sometimes 
even understand.

As for pulling the plug to test stuff. I recall a demo at Netapps in 
the early 00's.  They were talking about their fault tolerance and 
how great it was.  So I walked up to their demo array and said, "So, 
it shouldn't be a problem if I pulled this drive right here?"  Before 
I could the salesperson or tech guy, can't remember,  told me to 
stop.  He didn't want to risk it.

That right there said loads about their confidence in their own system.



>Late reply, but:
>
>On Sat, Jun 30, 2012 at 12:30 AM, Lynda <shrdlu at deaddrop.org> wrote:
> >...
> > Second, and more important. I *was* a "computer science guy" in a 
> past life,
> > and this is nonsense. You can have astonishingly large software projects
> > that just continue to run smoothly, day in, day out, and they don't hit the
> > news, because they don't break. There are data centers that don't hit the
> > news, in precisely the same way.
>
>I really need to write the book on IT reliability I keep meaning to.
>
>There's reliability - backwards looking statistical, which can be 100%
>for a given service or datacenter - and then there's dependability,
>forwards-predicted outage risks, which people often *assert* equals
>the prior reliability record, but in reality you often have a number
>of latent failures (and latent cascade paths) that you do not
>understand, did not identify previously, and are not aware of.
>
>I've had or had to respond to over a billion dollars of culminative IT
>disaster loss over my consulting career so far; I have NEVER seen
>anyone who did it perfect, even the best pros.  And I include myself
>in that list.
>
>Looking at other fields like aerospace and nuclear engineering, what
>is done in IT is not anywhere close to the same level of QA and
>engineering analysis and testing.  We cannot assert better results
>with less work.
>
>"Oh, that never happens", except I've had my stuff in three locations
>that had catastrophic generator failures.  "Oh, that never happens"
>when you're doing power maintenance and the best-rated electrical
>company in California, in conjunction with the generator vendor and a
>couple of independent power EEs, mis-balance the maintenance generator
>loads between legs and blow the generators and datacenter.  "Oh, that
>never happens" that the datacenter burns (or starts to burn and then
>gets flooded).  "Oh, that never happens" that the FM-200 goes off or
>preaction breaks and water leaks.  "Oh, that never happens" that well
>maintained and monitored and triple-redundant AC units all trip
>offline due to a common mode failure over the course of a weekend and
>the room gets up to 106 degrees.  Oh thank god the next thing didn't
>go wrong in THAT situation, because the spot temperature meters
>indicated that the ceiling height of that particular room peaked at 1
>degree short of the temp at which the sprinkler heads are supposed to
>discharge, so we nearly lost that room to flooding rather than just a
>10% disk and 15% power supply attrition over the next year...
>
>Don't be so confident in the infrastructure.  It's not engineered or
>built or maintained well enough to actually support that assertion.
>The same can be said of the application software and application
>architecture and integration.
>
>
>--
>-george william herbert
>george.herbert at gmail.com

Greg D. 
Moore 
http://greenmountainsoftware.wordpress.com/
CEO QuiCR: Quick, Crowdsourced Responses. http://www.quicr.net







More information about the NANOG mailing list