seeing the trees in the forest of confusion

Sat Apr 26 18:23:19 UTC 1997

I agree that there appears to be some underlying problem with the BGP code
on the backbone that is delaying route withdrawals beyond a reasonable
time.  We ran into a similar problem Wednesday night where one of our
customers started advertising more specifics for our network blocks to
another transit provider (who does not filter customer routes). After
shutting down the customer's BGP peering, the bogus routes were still in
the table an hour later at which time we started advertising our own more
specifics to restore service to our other customers -- this lead to our
unfortunate position in Thursday's CIDR report.

On a possibly related note, when we stopped advertising the more specifics
4 hours later, one of our transit providers (call them X) continued to
hold some of the more specific routes in a _portion_ of their BGP tables
with a next hop pointing to another of our transit providers (call them Y)
despite the fact that the Y no longer had the more specifics routes
anywhere in there tables.  This continued to cause a routing loop in X's
network (due to the inconsistent routes within their IBGP mesh) for 5
hours as X attempted to isolate the problem.  After that point, X's
solution was for us to announce more specifics for the affected networks
until they could schedule some core router reloads. 

These cases seem to point to a problem with BGP route withdrawls that will
continue to increase the time it takes to recover from network problems.
Perhaps the router vendors would like to comment.

- Doug

 /  Douglas A. Junkins    |   Network Engineering        \
/   Network Engineer      |   NorthWestNet                \
\   junkins at nwnet.net     |   Bellevue, Washington, USA   /
 \  +1-206-649-7419       |                              /

On Sat, 26 Apr 1997, Alex.Bligh wrote:

> >   I suppose it is more fun to criticize policy and NSPs, but it
> >   may well be a hole in the BGP protocol, or more likely
> >   implementations in vendor's code [or user's implementation
> >   of twiddleable holddown timers].
> 
> My (possibly misinformed) understanding was that certain NSPs running
> Cisco backbones had holddown timers configured to delay withdrawls. Even
> after 7007 was disconnected, there were 7007 routes still being advertised
> well over an hour later. I do not believe these NSPs are going to have
> timers configured for >1hr.
> 
> We've seen a problem before where a transit provider (Cisco based) was
> causing us problems, and we decided to turn them off. They were still
> advertising our routes an hour later. (Provider unconnected with any
> in this case). Pulling the session back up and clearing it did not
> help things.
> 
> I'd therefore suggest that your analysis is correct. >80% of the
> downtime is due either to a protocol bug or a s/w bug somewhere, not
> NOC failure.
> 
> Alex Bligh
> Xara Networks
> 
>