Thu Nov 5 02:57:33 UTC 2009

On Wed, 04 Nov 2009 12:26:15 CST, Joe Greco said:

> With power:
> N+1 is usually better than N
> Best to assume full load when doing math
> Things will go wrong, predict common failures

And uncommon ones. :)

So as part of a major compute-cluster install, we upgraded our UPS and diesel
generator one weekend, and breathed a collective sigh of relief that we were
now safe from power outages and mostly dodged a bullet. We *did* have some
scary moments when we discovered that (a) of the 400 or so disks on our Sun
E10K, about 10 didn't spin up again and (b) several of the boot disks on said
box weren't mirrored.  Fortunately, none of the 10 fails were on a non-mirrored
disk.  By Tuesday, all the non-mirrored boot disks were in fact mirrored.

That Friday, a bozo contractor relocating a doorway managed to set off the
Halon. Only lost two disks on the E10K.  Guess which two? ;)

And a month later, we discovered that the nice shiny new automatic cutover
switch was wired in backwards, necessitating another power outage to re-wire it

So much for safe from power outages... :)
