FYI Netflix is down

Mon Jul 2 21:15:10 UTC 2012

At 05:04 PM 7/2/2012, George Herbert wrote:
>On Mon, Jul 2, 2012 at 12:43 PM, Greg D. Moore <mooregr at greenms.com> wrote:
> > At 03:08 PM 7/2/2012, George Herbert wrote:
> >
> > If folks have not read it, I would suggest reading Normal Accidents by
> > Charles Perrow.
> >
> > The "it can't happen" is almost guaranteed to happen. ;-)  And 
> when it does,
> > it'll often interact in ways we can't predict or sometimes even understand.
>
>Seconded.

I figured you had probably read it. :-)

>There are also aerospace and nuclear and failure analysis books which
>are good, but I often encourage people to start with that one.
>
> > As for pulling the plug to test stuff. I recall a demo at Netapps in the
> > early 00's.  They were talking about their fault tolerance and how great it
> > was.  So I walked up to their demo array and said, "So, it shouldn't be a
> > problem if I pulled this drive right here?"  Before I could the salesperson
> > or tech guy, can't remember,  told me to stop.  He didn't want to risk it.
> >
> > That right there said loads about their confidence in their own system.
>
>I worked for a Sun clone vendor (Axil) for a while and took some of
>our systems and storage to Comdex one year in the 90s.  We had a RAID
>unit (Mylex controller) we had just introduced.  Beforehand, I made
>REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power
>tricks worked.  And showed them to people with the "Please keep in
>mind that this voids the warranty, but here we *rip* go...".  All of
>the other server vendors were giving me dirty looks for that one.
>Apparently I sold a few systems that way.

I can imagine. Back when we were testing a cluster from MicronPC, the 
techs were in our office and they encouraged us to do that.  It was 
re-assuring.

>You have to watch for connector wear-out and things like that, but ...
>
>All the clusters I've built, I've insisted on a burn-in time plug pull
>test on all the major components.  We caught things with those from
>time to time.  Especially with N+1, if it is really N+0 due to a bug
>or flaw you need to know that...

About 7 years back, we were about to move a production platform to a 
cluster+SAN that an outside vendor had installed.  I was brought in 
at the last minute to lead the project.  Before we did the move, I 
said, "Umm, has anyone tried a remote reboot of the servers?"

"Oh they rebooted fine when we were at the datacenter with the 
vendor.  We're good."

I repeated my question and finally did the old, "Ok, I know I'm being 
a pain, but please, let's just try it once, remotely before we're 
committed."  So we rebooted, and wait, and waited, and waited.

It took a trip out to the datacenter (we couldn't afford good remote 
KVM tools back then) to see that the server was trying to mount stuff 
off of something on the network.  At first we couldn't figure out 
what it was.  Finally realized it was looking for files on the 
vendor's laptop.  So of course it had worked fine when the vendor was 
at the datacenter.

Despite all that, the vendor still denied it being their problem.

Anyway, enough reminiscing.  Things happen.  We can only do so much 
to prevent them, and never assume.

>--
>-george william herbert
>george.herbert at gmail.com

Greg D. 
Moore 
http://greenmountainsoftware.wordpress.com/
CEO QuiCR: Quick, Crowdsourced Responses. http://www.quicr.net