Soliciting your opinions on Internet routing: A survey on BGP convergence

Jakob Heitz (jheitz) jheitz at cisco.com
Wed Jan 11 19:13:14 UTC 2017


When you simply bring down an ebgp session, withdraws will propagate throughout the network.
Soon after, the alternate routes will propagate. In the interim, some routers will lose connectivity.
This problem is solved by graceful shutdown.
This only works for planned shutdown
This interim time can be many minutes because of the advertisement-interval (MRAI timer).
A possible solution to reduce this interim to seconds instead of minutes is to set the MRAI timer to 0 on all routers. A potential problem with that is that any BGP instability in the network will cause some serious flapping.
Another alternative is to use BGP add-path (rfc7911) to distribute backup routes.
This will avoid the MRAI problem, but requires more memory on routers.
This also works for accidental shutdown.

Thanks,
Jakob.


> -----Original Message-----
> From: Jakob Heitz (jheitz)
> Sent: Tuesday, January 10, 2017 11:52 AM
> To: nanog at nanog.org; 'baldur.norddahl at gmail.com' <baldur.norddahl at gmail.com>
> Subject: RE: Soliciting your opinions on Internet routing: A survey on BGP convergence
> 
> Hi Baldur,
> 
> Have you tried graceful shutdown?
> You need redundant links, but not to the same transit.
> https://tools.ietf.org/html/draft-ietf-grow-bgp-gshut-06
> This draft is expired, but it is actually implemented by several vendors.
> 
> I implemented this.
> http://www.slideshare.net/bduvivie/bgp-graceful-shutdown-ios-xr
> I added an option to configure AS-path prepends in case the gshut community was not supported by peers.
> 
> Thanks,
> Jakob.
> 
> 
> > Date: Tue, 10 Jan 2017 03:51:04 +0100
> > From: Baldur Norddahl <baldur.norddahl at gmail.com>
> >
> > Hello
> >
> > I find that the type of outage that affects our network the most is
> > neither of the two options you describe. As is probably typical for
> > smaller networks, we do not have redundant uplinks to all of our
> > transits. If a transit link goes, for example because we had to reboot a
> > router, traffic is supposed to reroute to the remaining transit links.
> > Internally our network handles this fairly fast for egress traffic.
> >
> > However the problem is the ingress traffic - it can be 5 to 15 minutes
> > before everything has settled down. This is the time before everyone
> > else on the internet has processed that they will have to switch to your
> > alternate transit.
> >
> > The only solution I know of is to have redundant links to all transits.
> > Going forward I will make sure we have this because it is a huge
> > disadvantage not being able to take a router out of service without
> > causing downtime for all users. Not to mention that a router crash or
> > link failure that should have taken seconds at most to reroute, but
> > instead causes at least 5 minutes of unstable internet.
> >
> > Regards,
> >
> > Baldur



More information about the NANOG mailing list