BGP and The zero window edge

Job Snijders job at fastly.com
Wed Apr 21 21:11:26 UTC 2021


Dear Jakob, group,

On Wed, Apr 21, 2021 at 08:59:06PM +0000, Jakob Heitz (jheitz) via NANOG wrote:
> Ben's blog details an experiment in which he advertises routes and then
> withdraws them, but some of them remain stuck for days.
> 
> I'd like to get to the bottom of this problem.

I think there are *two* problems:

1) some BGP implementations (or multi-node BGP configurations) sometimes
   end up getting stuck in one way or another.

2) other BGP nodes are not able to disconnect/reconnect to systems
   suffering from instantiations of problem #1.

While on the one hand it is important to follow-up on each and every
instantiation of problem #1, I personally think it also is worthwhile
exploring whether the BGP FSM itself can be redefined in a way that
encourages BGP protocol implementations to be more robust and rely less
on the remote peer behaving correctly.

Once Problem #2 is addressed, finding and isolating instances of Problem
#1 will become much easier.

> Has anyone else seen this before or can provide data to analyze?
> On or off list.

>From the BGP Default-Free Zone perspective it is hard to differentiate
between an entire (multi-vendor) Autonomous System being stuck, or just
one router.

To test individual router implementations this tool is useful
https://github.com/benjojo/bgp-zerowindow-test - but please keep in mind
that "TCP Recv Wind == 0" trick is just one way to easily get a BGP peer
to manifest the problematic behavior.

>From a BGP protocol perspective BGP nodes shouldn't inspect the TCP
receive window, but rather focus on whether all locally available
signals indicate that the remote peer is still progressing data.

Kind regards,

Job


More information about the NANOG mailing list