Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Vanbever Laurent lvanbever at
Thu Jul 8 14:59:23 UTC 2021

> One method is collecting lookup exceptions. We scrape these:
>    command = "start shell sh command \"for
> fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}');
> do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\""
>    command = "start shell sh command \"for fpc in
> $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do
> echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\""
>    command = "show controllers np counters all"
>    command = "show pfe statistics exceptions"
> No need for ML or AI, as trivial algorithms like 'what counter is
> incrementing which isn't incrementing elsewhere' or 'what counter is
> not incrementing is incrementing elsewhere' shows a lot of real
> problems, and capturing those exceptions and reviewing confirms them.
> We do not use these to proactively find problems, as it would yield to
> poorer overall availability. But we regularly use them to expedite
> time to resolution.

Thanks for sharing! I guess this process working means the counters are "standard" / close enough across vendors to allow for comparisons?

> Very recently we had Tomahawk (EZchip) reset the whole linecard and
> looking at counters identifying counter which is incrementing but
> likely should not yielded the problem. Customer was sending us IP
> packets, where ethernet header and IP header until total length was
> missing on the wire, this accidentally fuzzed the NPU ucode
> periodically triggering NPU bug, which causes total LC reload when it
> happens often enough.

>>> Networks also routinely mangle packets in-memory which are not visible
>>> to FCS check.
>> Added to the list... Thanks!
> The only way I know how to try to find these memory corruptions is to
> look at egress PE device backbone facing interface and see if there
> are IP checksum errors.

More information about the NANOG mailing list