Resilience: faults, causes, statistics, open issues
David Andersen
dga at lcs.mit.edu
Thu Jan 27 16:37:33 UTC 2005
On Jan 27, 2005, at 6:39 AM, András Császár (IJ/ETH) wrote:
>
> Hi people!
>
> I've begun research on (carrier-grade, aka telecom-grade) resiliency
> in IP transport networks. The first step would be to collect possible
> failure events, their causes and consequences, statistics about
> downtimes (mean time to repair) and mean times between failures, and I
> would like to identify which of the problems are most typical (HW bug,
> SW bug, cable cut through, plugged out (link going down), severe
> misconfiguration).
>
> I think this is the perfect forum to get some feedback from real
> network-operational experience.
>
> Is anyone out there who has some statistics/documents that would help
> me in any way?
This is self-serving, but see the intro and related work sections of my
thesis (we'll have a conference paper version of it done soon for NSDI,
but we're still revising it. Apologies for not having a shorter
reference to give you):
http://nms.lcs.mit.edu/papers/index.php?detail=113
It doesn't focus specifically on carrier failures, but it has a batch
of references that might get you started on what the academic side
knows. I've also got some refs in there to some of the earlier teleco
studies, which I recommend taking a peek at. Again, relation to year
2005 ISP failures isn't totally clear, but it's a starting point.
Unfortunately, the reality is that we don't actually know all that much
as far as what's _really_ happening! Nick Feamster and I took a look
at some of the BGP routing failures (but didn't get back to root
causes):
http://nms.lcs.mit.edu/papers/index.php?detail=23
Nick's also done some work on configuration management and building a
better routing protocol that's somewhat related to your question.
Ratul Mahajan examined BGP configuration errors - but it's not clear
exactly what fraction of failures or downtime are really due to those
errors:
http://www.cs.washington.edu/homes/ratul/bgp/index.html
David Oppenheimer studied failures at a few edge companies (app.
service providers, hosting providers, etc.). Has a nice breakdown of
failure causes and durations, but it's not clear if those numbers
directly translate to the carrier realm:
http://roc.cs.berkeley.edu/papers/usits03.pdf
Finally, google back for some of Sean Donelan's NANOG posts. You'll
get some good individual cases from those, though the last time I
looked, I didn't find a big overall analysis.
> Also, do you have any suggestions on open research issues to be solved
> in the area?
Most of it. :) I (and probably others on this lis) would be
interested in what you find.
-Dave
More information about the NANOG
mailing list