FYI Netflix is down
George Herbert
george.herbert at gmail.com
Mon Jul 2 21:04:08 UTC 2012
On Mon, Jul 2, 2012 at 12:43 PM, Greg D. Moore <mooregr at greenms.com> wrote:
> At 03:08 PM 7/2/2012, George Herbert wrote:
>
> If folks have not read it, I would suggest reading Normal Accidents by
> Charles Perrow.
>
> The "it can't happen" is almost guaranteed to happen. ;-) And when it does,
> it'll often interact in ways we can't predict or sometimes even understand.
Seconded.
There are also aerospace and nuclear and failure analysis books which
are good, but I often encourage people to start with that one.
> As for pulling the plug to test stuff. I recall a demo at Netapps in the
> early 00's. They were talking about their fault tolerance and how great it
> was. So I walked up to their demo array and said, "So, it shouldn't be a
> problem if I pulled this drive right here?" Before I could the salesperson
> or tech guy, can't remember, told me to stop. He didn't want to risk it.
>
> That right there said loads about their confidence in their own system.
I worked for a Sun clone vendor (Axil) for a while and took some of
our systems and storage to Comdex one year in the 90s. We had a RAID
unit (Mylex controller) we had just introduced. Beforehand, I made
REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power
tricks worked. And showed them to people with the "Please keep in
mind that this voids the warranty, but here we *rip* go...". All of
the other server vendors were giving me dirty looks for that one.
Apparently I sold a few systems that way.
You have to watch for connector wear-out and things like that, but ...
All the clusters I've built, I've insisted on a burn-in time plug pull
test on all the major components. We caught things with those from
time to time. Especially with N+1, if it is really N+0 due to a bug
or flaw you need to know that...
--
-george william herbert
george.herbert at gmail.com
More information about the NANOG
mailing list