FYI Netflix is down

George Herbert george.herbert at gmail.com
Mon Jul 2 21:04:08 UTC 2012


On Mon, Jul 2, 2012 at 12:43 PM, Greg D. Moore <mooregr at greenms.com> wrote:
> At 03:08 PM 7/2/2012, George Herbert wrote:
>
> If folks have not read it, I would suggest reading Normal Accidents by
> Charles Perrow.
>
> The "it can't happen" is almost guaranteed to happen. ;-)  And when it does,
> it'll often interact in ways we can't predict or sometimes even understand.

Seconded.

There are also aerospace and nuclear and failure analysis books which
are good, but I often encourage people to start with that one.

> As for pulling the plug to test stuff. I recall a demo at Netapps in the
> early 00's.  They were talking about their fault tolerance and how great it
> was.  So I walked up to their demo array and said, "So, it shouldn't be a
> problem if I pulled this drive right here?"  Before I could the salesperson
> or tech guy, can't remember,  told me to stop.  He didn't want to risk it.
>
> That right there said loads about their confidence in their own system.

I worked for a Sun clone vendor (Axil) for a while and took some of
our systems and storage to Comdex one year in the 90s.  We had a RAID
unit (Mylex controller) we had just introduced.  Beforehand, I made
REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power
tricks worked.  And showed them to people with the "Please keep in
mind that this voids the warranty, but here we *rip* go...".  All of
the other server vendors were giving me dirty looks for that one.
Apparently I sold a few systems that way.

You have to watch for connector wear-out and things like that, but ...

All the clusters I've built, I've insisted on a burn-in time plug pull
test on all the major components.  We caught things with those from
time to time.  Especially with N+1, if it is really N+0 due to a bug
or flaw you need to know that...


-- 
-george william herbert
george.herbert at gmail.com




More information about the NANOG mailing list