Katrina Network Damage Report

Sat Sep 10 18:25:05 UTC 2005

Todd Underwood wrote:
> Sean Donelan wrote:
>> Todd Underwood wrote:
>> > the general idea is:  take a large peerset sending you full
>> > routes, keep every update forever, and take a reasonably long (at
>> > least a month or two) time horizon. calculate a consensus view for
>> > each prefix as to whether that prefix is reachable by some set of
>> > those peers.  an outaged prefix is one that used to be reachable that
>> > not no longer is.  in other words, one that has been withdrawn from
>> > the full table by some sufficiently large number of peers.
>> 
>> This describes a partioning, not necessarily an outage.
>
>can you explain what you mean?

I'm not sure if Sean's thinking the same thing I am,
but let me chime in with a nickel's worth of commentary.

There are some inconsistent terms used in computer
dependability research, but I prefer and use two
key definitions: failure (something is offline)
and outage (customer sees the service offline).

Various redundancy can hide failures from customers
and keep them from being true outages.

Looking at the routing tables you see failures.
If a prefix goes away completely and utterly,
and is truly unreachable, then anyone trying to
see it is going to see an outage.  But you can
have a lot of intermediate cases where routes are
mostly down but not completely, or where parts
of the net can see it but other parts can't
due to the vagarities of route propogation
and partial failures.

And there are situations where the route is
down but the service is still up.

There are other network monitoring groups
that do end to end connectivity tests from
geographically distributed clients out to
sample systems around the net.  Some for research
and some for hire for network monitoring.

I think what they do is much closer to
identifying true outages than your method.

-george william herbert
gherbert at retro.com