RS/960 upgrade ... status report

Fri May 1 04:23:18 UTC 1992

	From: William Manning <bmanning at is.rice.edu>
	Subject: Re: RS/960 upgrade ... status report
	Date: Thu, 30 Apr 92 23:07:02 CDT

	Mark said:
	observed during RS/960 testing. The problem involved some aberrant
	behavior of rcp_routed when a misconfigured regional peer router would
	advertise a route via BGP to an ENSS whose next hop was the ENSS
	itself. The old rcp_routed could go into a loop sending multiple
	redirect packets out on to the subnet. The new rcp_routed will close
	the BGP session if it receives an announcement of such a route.  The
	new rcp_routed software also has support for externally administered
	inter-AS metrics, an auto-restart capability, and bug fixes for BGP
	overruns with peer routers.

	        This deployment caused a few problems. One is that this
	new feature of rcp_routed pointed out a misconfigured peer router
	at Rice University in Houston. This caused the BGP connection to
	open and close rapidly which caused further problems on the peer
	router. Eventually the peer was reconfigured to remove the
	bad route, which fixed the problem. Another problem was on the
	Argonne ENSS. This node crashed in such a way that it was

			----------------------------------

	What he did not say:

		The new rcp_routed will (virtually) immeadiately after the
	close, reopen a BGP session and pump all known routes at the BGP peer.
	If the peer is already working on processing the old ones, this adds an
	unneeded burden on the regional peer.  I have not reviewed the spec in
	detail, but there should be something in place to prevent a constant
	cycle of close, open, slam 5k routes, close, open, slam 5k routes,
	close...  well, you get the picture. This points out a configuration
	problem with BGP in the ANS T3 routers, in addition to the less than
	optimal configuration that we had in our ciscos.  

	Just my two cents from the other side of the fence.

Bill,

   Given that we haven't heard of anyone else's peer router dying a horrible
death since the new rcp_routed went in, I assume that the "less than optimal
configuration" isn't a *common* mistake.  But, for the benefit of those of us
who are soon to run BGP (and who seem to have a knack for encountering uncommon
mistakes :-), can you fill us in on what the actual config problem was?  Were
you redistributing your EGP-learned-routes back to BGP or something?

Mark,

    Wouldn't it make more sense for the ENSS to just ignore the offending route
rather than close the BGP session, especially given the lack of a delay before
the session gets reestablished?

Dan