Resilience: faults, causes, statistics, open issues

David Andersen dga at
Fri Jan 28 18:43:51 UTC 2005

On Jan 28, 2005, at 5:30 AM, András Császár (IJ/ETH) wrote:
> Just some comments about the root causes of BGP related problems, 
> maybe you find something useful from the research perspective, 
> although probably this is not going to be new for you.
> I found a few author groups with very related and useful papers:
> - Tim Griffin and co.
> - Nick Feamster and co.
> - Jennifer Rexford and co.
> - Lixin Gao and co.

   Yup.  That particular group you mentioned has a lot of interplay.

> These people often have joint publications but sometimes separate as 
> well. Also, Craig Labovitz and co have some very useful papers in the 
> area of routing convergence time.

Yes.  There's also Morley Mao's convergence work.
> As I see things now, in case of BGP, routing divergence, configuration 
> and policies have a very strong correlation.
> A high level conclusion (what you probably can expect from half year 
> paper- and presentation-reading research) is that the first root cause 
> of BGP problems is the absence of a >>widely deployed and practical<< 
> formal language for policies. Since there is no formal language, there 
> is
>  no compiler, and so you have unwanted anomalies resulting from your 
> config.

   In a sense.  I think that this is one of the root causes, but it's 
perhaps not the only one.  I think we can group it into two areas:

   a)  Fundamental BGP problems
         (e.g., the convergence/flap damping issues, etc.).   By 
"fundamental" I don't mean uncorrectable - I simply mean that they're 
"features" of the protocol as it exists today.  Some may be fundamental 
trade-offs in global routing;  I don't know.

   b)  The abovementioned policy issue

Some of the issues in (a) can be corrected through (b) - for example, 
the Gao/Rexford examination of what policies can be permitted if you 
want to ensure stable routing.  Given that BGP is a strongly 
policy-driven beast, many, many of its problems do arise from this.

> So, in the end, although we can possibly identify the root causes 
> behind BGP problems, I'm not sure they can ever be fully ceased. OK, I 
> can imagine a formal language and config compiler, and one can find 
> verification tools as well, but I can hardly imagine e.g. the sharing 
> of policies (although some papers write about methods how to infer the 
> necessary knowledge from measurements).

Agreed.  I think we'll make steps, though, and I think that groups of 
collaborating providers can probably implement some of the solutions 
between themselves in ways that make sense.

> p.s. Sorry for the long mail :) :)

No worries - quite interesting.  (to me, at least!)


More information about the NANOG mailing list