Do you care about "gray" failures? Can we (network academics) help? A 10-min survey
Mark Tinka
mark at tinka.africa
Thu Jul 8 12:59:59 UTC 2021
On 7/8/21 14:29, Saku Ytti wrote:
> Network experiences gray failures all the time, and I almost never
> care, unless a customer does. If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
>
> Fixing these can take months of working with vendors and attempts to
> remedy will usually cause planned or unplanned outages. So it rarely
> makes sense to try to fix as they usually impact a trivial amount of
> traffic.
>
> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.
I was going to say the exact same thing.
+1.
It's all par for the course, which is why we get up everyday :-).
I'm currently dealing with an issue that will forward a customer's
traffic to/from one /24, but not the rest of their IPv4 space, including
the larger allocation from which the /24 is born. It was a gray issue
while the customer partially activated, and then caused us to care when
they tried to fully swing over.
We've had an issue that has lasted over a year but only manifested
recently, where someone wrote a static route pointing to an indirect
next-hop, mistakenly. The router ended up resolving it and forwarding
traffic, but in the process, was spiking CPU in a manner that was not
immediately evident from the NMS. Fixing the next-hop resolved the
issue, as would improving service provisioning and troubleshooting
manuals :-).
Like Saku says, there's always something, and attention to it will be
granted depending on how much visible pain it causes.
Mark.
More information about the NANOG
mailing list