Resilience: faults, causes, statistics, open issues

David Andersen dga at lcs.mit.edu
Thu Jan 27 16:37:33 UTC 2005


On Jan 27, 2005, at 6:39 AM, András Császár (IJ/ETH) wrote:

>
> Hi people!
>
> I've begun research on (carrier-grade, aka telecom-grade) resiliency 
> in IP transport networks. The first step would be to collect possible 
> failure events, their causes and consequences, statistics about 
> downtimes (mean time to repair) and mean times between failures, and I 
> would like to identify which of the problems are most typical (HW bug, 
> SW bug, cable cut through, plugged out (link going down), severe 
> misconfiguration).
>
> I think this is the perfect forum to get some feedback from real 
> network-operational experience.
>
> Is anyone out there who has some statistics/documents that would help 
> me in any way?

This is self-serving, but see the intro and related work sections of my 
thesis (we'll have a conference paper version of it done soon for NSDI, 
but we're still revising it.  Apologies for not having a shorter 
reference to give you):

   http://nms.lcs.mit.edu/papers/index.php?detail=113

It doesn't focus specifically on carrier failures, but it has a batch 
of references that might get you started on what the academic side 
knows.  I've also got some refs in there to some of the earlier teleco 
studies, which I recommend taking a peek at.  Again, relation to year 
2005 ISP failures isn't totally clear, but it's a starting point.

Unfortunately, the reality is that we don't actually know all that much 
as far as what's _really_ happening!  Nick Feamster and I took a look 
at some of the BGP routing failures (but didn't get back to root 
causes):

http://nms.lcs.mit.edu/papers/index.php?detail=23

Nick's also done some work on configuration management and building a 
better routing protocol that's somewhat related to your question.

Ratul Mahajan examined BGP configuration errors - but it's not clear 
exactly what fraction of failures or downtime are really due to those 
errors:

http://www.cs.washington.edu/homes/ratul/bgp/index.html

David Oppenheimer studied failures at a few edge companies (app. 
service providers, hosting providers, etc.).  Has a nice breakdown of 
failure causes and durations, but it's not clear if those numbers 
directly translate to the carrier realm:

http://roc.cs.berkeley.edu/papers/usits03.pdf

Finally, google back for some of Sean Donelan's NANOG posts.  You'll 
get some good individual cases from those, though the last time I 
looked, I didn't find a big overall analysis.


> Also, do you have any suggestions on open research issues to be solved 
> in the area?

   Most of it. :)  I (and probably others on this lis) would be 
interested in what you find.

   -Dave



More information about the NANOG mailing list