Resilience: faults, causes, statistics, open issues

András Császár (IJ/ETH) Andras.Csaszar at ericsson.com
Fri Jan 28 10:30:06 UTC 2005


Hi David, this is going to be very useful, I really appretiate it, thank you very much.

Just some comments about the root causes of BGP related problems, maybe you find something useful from the research perspective, although probably this is not going to be new for you.

I found a few author groups with very related and useful papers:

- Tim Griffin and co.
- Nick Feamster and co.
- Jennifer Rexford and co.
- Lixin Gao and co.

These people often have joint publications but sometimes separate as well. Also, Craig Labovitz and co have some very useful papers in the area of routing convergence time.

The IRTF also has some interesting, futuristic and somewhat visionary drafts about "Future Domain Routing".

As I see things now, in case of BGP, routing divergence, configuration and policies have a very strong correlation.

A high level conclusion (what you probably can expect from half year paper- and presentation-reading research) is that the first root cause of BGP problems is the absence of a >>widely deployed and practical<< formal language for policies. Since there is no formal language, there is no compiler, and so you have unwanted anomalies resulting from your config.

My conclusion was that BGP has an analogy to software development:

SW: Specification=>High-level formal language (e.g. C++)=>Low-level formal language (assembly, binary, etc.)

Both steps can be called implementation or compiliation. The good thing here is that you have automated compilers for the second step, which is harder.

BGP: Business relation=>Policies=>Router configuration

First you implement your business relations, when you think out policies, but in the end you will have to implement/compile your policies as router configuration. The problem is, there is no automated compiler for the second step, since there is no formal policy language, and so verifcation is also very hard.

As a result you may have configuration bugs or your config is not doing what you originally wanted to do, or you have inconsistency among your routers, etc.

Of course, it is clear why such a formal language and compiler is not used in practice (different router vendors, different features, different capabilities, no standard interface, etc.), although there is, e.g., RPSL and the tools built upon RPSL. Lately, Griffin and co have begun thinking about a completely new policy language.


The second root cause that I think can be somewhat separated is that there is no practically used central database about policies. You do not necessary know what your neighbour operators are doing (their configs and policies). As a result you may have external inconsistency (that may lead to divergence, "wedgies", etc.).

Of course, here it is also clear why, e.g., IRRs are not used or not updated frequently (information hiding principle , which is actually the basis of the hierarchical domain structure of the internet).


So, in the end, although we can possibly identify the root causes behind BGP problems, I'm not sure they can ever be fully ceased. OK, I can imagine a formal language and config compiler, and one can find verification tools as well, but I can hardly imagine e.g. the sharing of policies (although some papers write about methods how to infer the necessary knowledge from measurements).

Thanks again for you help,
András

p.s. Sorry for the long mail :) :)


----Original Message----
From: David Andersen [mailto:dga at lcs.mit.edu]
Sent: 2005. január 27. 17:38
To: András Császár (IJ/ETH)
Cc: nanog at merit.edu
Subject: Re: Resilience: faults, causes, statistics, open issues

> On Jan 27, 2005, at 6:39 AM, András Császár (IJ/ETH) wrote:
> 
>> 
>> Hi people!
>> 
>> I've begun research on (carrier-grade, aka telecom-grade) resiliency
>> in IP transport networks. The first step would be to collect possible
>> failure events, their causes and consequences, statistics about
>> downtimes (mean time to repair) and mean times between failures, and
>> I would like to identify which of the problems are most typical (HW
>> bug, SW bug, cable cut through, plugged out (link going down),
>> severe misconfiguration). 
>> 
>> I think this is the perfect forum to get some feedback from real
>> network-operational experience.
>> 
>> Is anyone out there who has some statistics/documents that would help
>> me in any way?
> 
> This is self-serving, but see the intro and related work sections of
> my thesis (we'll have a conference paper version of it done soon for
> NSDI, but we're still revising it.  Apologies for not having a shorter
> reference to give you):
> 
>    http://nms.lcs.mit.edu/papers/index.php?detail=113
> 
> It doesn't focus specifically on carrier failures, but it has a batch
> of references that might get you started on what the academic side
> knows.  I've also got some refs in there to some of the earlier teleco
> studies, which I recommend taking a peek at.  Again, relation to year
> 2005 ISP failures isn't totally clear, but it's a starting point.
> 
> Unfortunately, the reality is that we don't actually know all that
> much as far as what's _really_ happening!  Nick Feamster and I took a
> look at some of the BGP routing failures (but didn't get back to root
> causes):
> 
> http://nms.lcs.mit.edu/papers/index.php?detail=23
> 
> Nick's also done some work on configuration management and building a
> better routing protocol that's somewhat related to your question.
> 
> Ratul Mahajan examined BGP configuration errors - but it's not clear
> exactly what fraction of failures or downtime are really due to those
> errors:
> 
> http://www.cs.washington.edu/homes/ratul/bgp/index.html
> 
> David Oppenheimer studied failures at a few edge companies (app.
> service providers, hosting providers, etc.).  Has a nice breakdown of
> failure causes and durations, but it's not clear if those numbers
> directly translate to the carrier realm:
> 
> http://roc.cs.berkeley.edu/papers/usits03.pdf
> 
> Finally, google back for some of Sean Donelan's NANOG posts.  You'll
> get some good individual cases from those, though the last time I
> looked, I didn't find a big overall analysis.
> 
> 
>> Also, do you have any suggestions on open research issues to be
>> solved in the area?
> 
>    Most of it. :)  I (and probably others on this lis) would be
> interested in what you find.
> 
>    -Dave



More information about the NANOG mailing list