Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Mark Tinka mark at tinka.africa
Thu Jul 8 12:59:59 UTC 2021



On 7/8/21 14:29, Saku Ytti wrote:

> Network experiences gray failures all the time, and I almost never
> care, unless a customer does. If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
>
> Fixing these can take months of working with vendors and attempts to
> remedy will usually cause planned or unplanned outages. So it rarely
> makes sense to try to fix as they usually impact a trivial amount of
> traffic.
>
> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.

I was going to say the exact same thing.

+1.

It's all par for the course, which is why we get up everyday :-).

I'm currently dealing with an issue that will forward a customer's 
traffic to/from one /24, but not the rest of their IPv4 space, including 
the larger allocation from which the /24 is born. It was a gray issue 
while the customer partially activated, and then caused us to care when 
they tried to fully swing over.

We've had an issue that has lasted over a year but only manifested 
recently, where someone wrote a static route pointing to an indirect 
next-hop, mistakenly. The router ended up resolving it and forwarding 
traffic, but in the process, was spiking CPU in a manner that was not 
immediately evident from the NMS. Fixing the next-hop resolved the 
issue, as would improving service provisioning and troubleshooting 
manuals :-).

Like Saku says, there's always something, and attention to it will be 
granted depending on how much visible pain it causes.

Mark.



More information about the NANOG mailing list