Time to revise RFC 1771

Tue Jun 26 20:27:49 UTC 2001

This ignores three basic facts:

1) Networks tend to be homogenous in platform.
2) Platforms tend to accept their own implementation quirks
3) Networks peer at borders

Therefore, under the "drop the session rule," my bad announcement
gets to all my borders fine, and all my external peers who are not
running forgiving/compatable implementations drop their connections
to me and all my traffic to/from them hits the floor.

One CRC error does not make PPP drop.  Why make one route cause
a catastrophic loss of connectivity?  Report the bad route,
drop it, and move on; let layer 8 resolve it.

-Dave

On 6/26/2001 at 13:08:46 -0700, Clayton Fiske said:

> The basic issue is one of scale vs integrity. However, I think this
> particular case is one in which the RFC-dictated behavior is the
> correct choice. The problem is that one [set of] router[s] did not
> follow such behavior and thus escalated the scale of the problem
> significantly.
> 
> Given that the malformed route in question was most likely originated
> from a single router, the only damage that should have been done was
> a loss of routability for networks behind that one router. While of
> course that could be arguably a significant number of networks, I
> think it's a safe assumption that X losing its peers is pretty much
> always a smaller impact than all of X's peers losing -their- peers.
> If network XYZ's routers have N peers each, the RFC-dictated
> behavior gives us N peering sessions lost (assuming the offending
> route was advertised to all peers), instead of N^2 (or greater)
> sessions as was the case.
> 
> I think the logic of dropping the session is sound. If a router
> originates one malformed route, who's to say the rest of its routes
> are correct? Perhaps other routes are corrupted, but not in ways
> detectable by the router's sanity checks. Since the offending route
> is indeed malformed, it's not unreasonable to stop trusting the
> router from which it originated. Since it's likely[1] only a single
> router is originating the route, dropping sessions to that one
> router controls the blast radius[2].
> 
> This is not to say that the issue of scale is unimportant. It most
> certainly is. However, again, if the first router(s) to receive
> the route had behaved properly, the scale of the problem would
> have been small. The only place you'd see a flap of 100,000
> routes is if the offending router was your upstream's. Everyone
> else would only see (at most) a flap of the routes originated by
> and/or behind that router (in BGP topology terms).
> 
> Perhaps a knob to control the behavior would be an acceptable
> compromise for some. I think it's a bad idea for two reasons.
> First, it allows bugs such as this to go unfixed, because when
> it happens people just adjust the knob to keep their BGP sessions
> stable. Second, it circumvents the integrity control. If a router
> has many corrupted routes, but only a few trigger the sanity
> checks for malformation, the session stays alive and the remaining
> corrupted routes are then propagated network-wide. While this may
> seem like a paranoid philosophy, a little paranoia can be good
> when considering the integrity of the larger whole.
> 
> -c
> 
> [1] = Yes, "likely" is a relative term. I know there are plenty of
>       cases where the same route is originated by multiple routers,
>       however the odds of more than one of them corrupting a route
>       at the same time are probably slim compared to the odds of
>       a single one doing so.
> 
> [2] = In this specific case, as I understand it, the direct peers did
>       in fact drop the offending BGP session, however they propagated
>       the offending announcement to their peers before doing so. In
>       this case, of course, the blast radius is not controlled.
> 
> 

-- 
Dave Israel
Senior Manager, IP Backbone
Intermedia Business Internet