Followup British Telecom outage reason

Ian Duncan Ian.Duncan at
Mon Nov 26 15:46:49 UTC 2001

Wandering off the subject of BT's misfortune ...

Sean Donelan wrote:

> On Mon, 26 Nov 2001, Christian Kuhtz wrote:


> > Faults will happen.  And nothing matters as much as how your prepare for
> > when they do.
> Mean Time To Repair is a bigger contributor to Availability calculations
> than the Mean Time To Failure.  It would be great if things never failed.

And Mean Time To Fault Detected (Accurately) is usually the biggest
sub-contributor within Repair but that's kinda your point.

> But some people are making their systems so complicated chasing the Holy
> Grail of 100% uptime, they can't figure out what happened when it does
> fail.

Similar people pursue creation of perpetuum mobile. A strange and somewhat
congruent example stumbled into recently is:

Overall simplicity of the system, including failure detection mechanisms, and real
redundancy are the most reliable tools for availablity. Of course, popping just a
few layers out, profit and politics are elements of most systems.

> Murphy's revenge: The more reliable you make a system, the longer it will
> take you to figure out what's wrong when it breaks.


