FYI Netflix is down

George Herbert george.herbert at gmail.com
Mon Jul 2 19:08:28 UTC 2012


Late reply, but:

On Sat, Jun 30, 2012 at 12:30 AM, Lynda <shrdlu at deaddrop.org> wrote:
>...
> Second, and more important. I *was* a "computer science guy" in a past life,
> and this is nonsense. You can have astonishingly large software projects
> that just continue to run smoothly, day in, day out, and they don't hit the
> news, because they don't break. There are data centers that don't hit the
> news, in precisely the same way.

I really need to write the book on IT reliability I keep meaning to.

There's reliability - backwards looking statistical, which can be 100%
for a given service or datacenter - and then there's dependability,
forwards-predicted outage risks, which people often *assert* equals
the prior reliability record, but in reality you often have a number
of latent failures (and latent cascade paths) that you do not
understand, did not identify previously, and are not aware of.

I've had or had to respond to over a billion dollars of culminative IT
disaster loss over my consulting career so far; I have NEVER seen
anyone who did it perfect, even the best pros.  And I include myself
in that list.

Looking at other fields like aerospace and nuclear engineering, what
is done in IT is not anywhere close to the same level of QA and
engineering analysis and testing.  We cannot assert better results
with less work.

"Oh, that never happens", except I've had my stuff in three locations
that had catastrophic generator failures.  "Oh, that never happens"
when you're doing power maintenance and the best-rated electrical
company in California, in conjunction with the generator vendor and a
couple of independent power EEs, mis-balance the maintenance generator
loads between legs and blow the generators and datacenter.  "Oh, that
never happens" that the datacenter burns (or starts to burn and then
gets flooded).  "Oh, that never happens" that the FM-200 goes off or
preaction breaks and water leaks.  "Oh, that never happens" that well
maintained and monitored and triple-redundant AC units all trip
offline due to a common mode failure over the course of a weekend and
the room gets up to 106 degrees.  Oh thank god the next thing didn't
go wrong in THAT situation, because the spot temperature meters
indicated that the ceiling height of that particular room peaked at 1
degree short of the temp at which the sprinkler heads are supposed to
discharge, so we nearly lost that room to flooding rather than just a
10% disk and 15% power supply attrition over the next year...

Don't be so confident in the infrastructure.  It's not engineered or
built or maintained well enough to actually support that assertion.
The same can be said of the application software and application
architecture and integration.


-- 
-george william herbert
george.herbert at gmail.com




More information about the NANOG mailing list