Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Jörg Kost jk at ip-clear.de
Thu Jul 8 13:58:57 UTC 2021


We have a similar gray issue, where switches in a virtual chassis 
configuration with layer3-configuration seem to lose transit ICMP 
messages like echo or echo-reply randomly. Once we estimated it around 
0.00012% ( let alone variances, or errors in measuring ).

We noticed this when we replaced Nagios with some more bursting, 
trigger-happy monitoring software a few years back. Since then, it's 
reporting false positives from time to time, and this can become 
annoying.

Besides spending a lot of time debugging this, we never had a 
breakthrough in finding the root cause, just looking to replace things 
in the next year.

On 8 Jul 2021, at 15:28, Mark Tinka wrote:

> On 7/8/21 15:22, Vanbever Laurent wrote:
>
>> Did you folks manage to understand what was causing the gray issue in 
>> the first place?
>
> Nope, still chasing it. We suspect a FIB issue on a transit device, 
> but currently building a test to confirm.
>
> Mark.


More information about the NANOG mailing list