Did your BGP crash today?

Sat Aug 28 15:58:54 UTC 2010

On Sat, Aug 28, 2010 at 02:19:28PM +0200, Florian Weimer wrote:
> * Claudio Jeker:
> 
> > I think you blame the wrong people. The vendor should make sure that
> > their implementation does not violate the very basics of the BGP
> > protocol.
> 
> The curious thing here is that the peer that resets the session, as
> required by the spec, causes the actual damage (the session reset),
> and not the peer producing the wrong update.
> 
> This whole thread is quite schizophrenic because the consensus appears
> to be that (a) a *researcher is not to blame* for sending out a BGP
> message which eventually leads to session resets, and (b) an
> *implementor is to blame* for sending out a BGP messages which
> eventually leads to session resets.  You really can't have it both
> ways.

The researcher is not to blame because all the BGP messages he sent out
were properly formed.

The implementor is to blame becuase the code he wrote send out BGP
messages which were not properly formed.

> I'm fed up with this situation, and we will fix it this time.  My take
> is that if you reset the session, you're part of the problem, and
> consequently deserve part of the blame.  So if you receive a
> properly-framed BGP update message you cannot parse, you should just
> log it, but not take down the session.

If you get your wish, and that gets implemented, in some numer of years
trree will be a NANOG posting (perhaps from you, perhaps not) arguing
that any malformed BGP message should result in the session being torn
down.  This will be after a router develops a failure that causes it to
send many incorrect messages, but only some of them malformed.  So the
malformed ones will be discarded, the remainder will be propogated
throughout the Internet.  If the ones that are incorrect but not
malformed are, say, filled with more specifics for large portions of
the Internet, someone will be asking: "How could all the other routers
accept these advertisement from a router known to be broken ... it was
sending malformed advertisements, but instead of tearning down the
sessions, you decided to trust all the validly formed messages from
this known-to-be-broken router".

My point is:  we can't always look at the most recent failure to decide
what the correct policy is.  We have good data on the cases where
NOTIFY on any malformed packet has caused significantly outages in the
Internet.  We don't have nearly as good data on the cases where
NOTIFY-on-any-malformed-packet saved the Internet from a significant
outage.

I don't claim to know which is the bigger problem.  But any serious
argument to change the behavior needs to consider the risk from
propogating information received from a router known to be broken, on
the theory that the brokenness only causes malformed messages (which
can be discarded) and does not also cause incorrect but correctly
formed messages to be sent.

     -- Brett