Quick question.

Robert E. Seastrom rs at seastrom.com
Sun Aug 1 22:37:36 UTC 2004



"Michel Py" <michel at arneill-py.sacramento.ca.us> writes:

> The dead processor still has to be replaced, but this is scheduled
> maintenance, not outage. A little extra ammo when you have to hunt five
> or six nines.

MTTR on a single box is irrelevant when you are off playing Ponce de
Leon, hunting the Fountain of Five or Six Nines.  Even when your
architecture doesn't depend on any one particular machine (or even whole
big sets of machines) being available, you don't get to "five or six
nines"... just ask Google, Akamai, or Microsoft - there are other
things beyond your control that spoil the picnic first.

As has been observed time and time again, the tried and true way to
make five or six nines of reliability in a system of more than trivial
complexity is to take a lesson from the telcos (the progenitors of the
"five nines" lie) and build a framework and evaluation methodology
that excludes broad classes of unavailability-causing events or
prorates them in such a way as to make them non-reportable.  Add to
that list incrementally, until the remaining time listed shows your
target number of nines of reliability.  Presto, five nines.

                                        ---Rob





More information about the NANOG mailing list