Followup British Telecom outage reason

Sean Donelan sean at donelan.com
Mon Nov 26 11:28:22 UTC 2001




On Mon, 26 Nov 2001, Christian Kuhtz wrote:
> Now, if lack of infrastructure realiability can harm human life you may feel
> differently, but that isn't the case for most of us at the present time.

I've designed software and networks used for public safety and
emergencies.  And yes, people have died on my watch. It is a somewhat
different mindset, but not that different.  A lot of "good engineering
practice" applies to any engineering activity, including software
engineering.

Its not even a matter of cost.  A typical hospital spends less on
their emergency power system than a Internet/telco hotel.  The major
difference is the hospital staff knows (more or less) what to do when
the generators don't work.

The big secret is most "life safety" systems fail regularly.  Most of
the time it doesn't matter because the "big one" doesn't coincide with
the failure.


> Faults will happen.  And nothing matters as much as how your prepare for
> when they do.

Mean Time To Repair is a bigger contributor to Availability calculations
than the Mean Time To Failure.  It would be great if things never failed.
But some people are making their systems so complicated chasing the Holy
Grail of 100% uptime, they can't figure out what happened when it does
fail.

Murphy's revenge: The more reliable you make a system, the longer it will
take you to figure out what's wrong when it breaks.





More information about the NANOG mailing list