BGP and The zero window edge

Wed Apr 21 23:44:01 UTC 2021

I'm not sure if this is helpful to this discussion or not, but I recently became aware of a bug in a virtual router using DPDK+VPP which sounds like it could possibly produce a similar issue to what is being described, without the TCP window being a factor.

The system used the same process to read and process the messages coming in to the netlink socket. When a large BGP update was being processed it was possible that the netlink buffer was being filled while previous updates were being processed. This caused some route updates to not be processed, not applied to the VPP FIB, and so they became stuck. The particular vendor I spoke to about this issue resolved this by giving priority to reading and storing the messages for processing, and asynchronously processing those messages in batches. 

I can share additional details off-list if anyone thinks this could be related to the problem.

-----Original Message-----
From: NANOG <nanog-bounces+philip.loenneker=tasmanet.com.au at nanog.org> On Behalf Of Job Snijders via NANOG
Sent: Thursday, 22 April 2021 9:25 AM
To: Jakob Heitz (jheitz) <jheitz at cisco.com>
Cc: nanog at nanog.org
Subject: Re: BGP and The zero window edge

On Wed, Apr 21, 2021 at 09:22:57PM +0000, Jakob Heitz (jheitz) wrote:
> I'd like to get some data on what actually happened in the real cases 
> and analyze it.
>
> [snip]
> 
> TCP zero window is possible, but many other things could cause it too.

Indeed. There could be a number of reasons that caused it.

Switchings away from TCP win=0 towards "Zombie Routes":

*RIGHT NOW* (at the moment of writing), there are a number of zombie route visible in the IPv6 Default-Free Zone:

One example is https://aus01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flg.ring.nlnog.net%2Fprefix_detail%2Flg01%2Fipv6%3Fq%3D2a0b%3A6b86%3Ad15%3A%3A%2F48&data=04%7C01%7Cphilip.loenneker%40tasmanet.com.au%7C054f1c15d7534f2e671c08d9051d4626%7Cb53dc580ab7847208b30536f36d398ac%7C0%7C0%7C637546445559391894%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ckoULXFPBZnMqFWIwq87PwXJAPhevmIhIbk0ywq2ZMM%3D&reserved=0

    2a0b:6b86:d15::/48 via:
        BGP.as_path: 204092 57199 35280 6939 42615 42615 212232
        BGP.as_path: 208627 207910 57199 35280 6939 42615 42615 212232
        BGP.as_path: 208627 207910 57199 35280 6939 42615 42615 212232
    (first announced April 15th, last withdrawn April 15th, 2021)

Another one is https://aus01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flg.ring.nlnog.net%2Fprefix_detail%2Flg01%2Fipv6%3Fq%3D2a0b%3A6b86%3Ad24%3A%3A%2F48&data=04%7C01%7Cphilip.loenneker%40tasmanet.com.au%7C054f1c15d7534f2e671c08d9051d4626%7Cb53dc580ab7847208b30536f36d398ac%7C0%7C0%7C637546445559391894%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=E8gIINgqG6J5NO2VQQ9ledvXKJeUWzRc42EgUt9fm4k%3D&reserved=0

    2a0b:6b86:d24::/48 via:
        BGP.as_path: 201701 9002 6939 42615 212232
        BGP.as_path: 34927 9002 6939 42615 212232
        BGP.as_path: 207960 34927 9002 6939 42615 212232
        BGP.as_path: 44103 50673 9002 6939 42615 212232
        BGP.as_path: 208627 207910 34927 9002 6939 42615 212232
        BGP.as_path: 3280 34927 9002 6939 42615 212232
        BGP.as_path: 206628 34927 9002 6939 42615 212232
        BGP.as_path: 208627 207910 34927 9002 6939 42615 212232
    (first announced March 24th, last withdrawn March 24th, 2021)

Just now, I literally rebooted the BGP speaker behind lg.ring.nlnog.net to make ensure that those routes are not stuck in the BGP looking glass itself. 

2a0b:6b86:d24::/48 was first announced on March 24th, 2021, and withdrawn at the end of March 24th, 2021 by the originator, and now almost a month later, this prefix still is visible in the default-free zone despite WITHDRAW messages having been sent and the AS 212232 operator confirming they are not announcing that IP prefix anywhere.

I checked the AS 6939 Looking glass, but the d24::/48 route is not visible in the https://aus01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flg.he.net%2F&data=04%7C01%7Cphilip.loenneker%40tasmanet.com.au%7C054f1c15d7534f2e671c08d9051d4626%7Cb53dc580ab7847208b30536f36d398ac%7C0%7C0%7C637546445559391894%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=igVISlzWFPJK43%2FZtu%2FalxmtabPDq8d2H16JYmGyp6c%3D&reserved=0 web interface. This leads me to believe the the route got stuck somewhere along way in either of 201701, 204092, 206628, 207910, 207960, 208627, 3280, 34927, 35280, 44103, 50673, 57199, and/or 9002.

This implies indeed might be multiple reasons a BGP route gets stuck ('stuck' as in - a WITHDRAW was not generated, or ignored). Perhaps on any one of these edges there is a very high Out Queue for one reason or
another:

    34927 9002
    206628 34927
    44103 50673
    207960 34927
    3280 34927
    9002 6939
    201701 9002
    208627 207910

I'm not sure all the these sightings of stuck routes can be pinpointed to one specific BGP vendor (or one bug).

Kind regards,

Job