Do you care about "gray" failures? Can we (network academics) help? A 10-min survey
lvanbever at ethz.ch
Thu Jul 8 13:22:06 UTC 2021
> On 8 Jul 2021, at 14:59, Mark Tinka <mark at tinka.africa> wrote:
> On 7/8/21 14:29, Saku Ytti wrote:
>> Network experiences gray failures all the time, and I almost never
>> care, unless a customer does. If there is a network which does not
>> experience these, then it's likely due to lack of visibility rather
>> than issues not existing.
>> Fixing these can take months of working with vendors and attempts to
>> remedy will usually cause planned or unplanned outages. So it rarely
>> makes sense to try to fix as they usually impact a trivial amount of
>> Networks also routinely mangle packets in-memory which are not visible
>> to FCS check.
> I was going to say the exact same thing.
> It's all par for the course, which is why we get up everyday :-).
> I'm currently dealing with an issue that will forward a customer's traffic to/from one /24, but not the rest of their IPv4 space, including the larger allocation from which the /24 is born. It was a gray issue while the customer partially activated, and then caused us to care when they tried to fully swing over.
Did you folks manage to understand what was causing the gray issue in the first place?
> We've had an issue that has lasted over a year but only manifested recently, where someone wrote a static route pointing to an indirect next-hop, mistakenly. The router ended up resolving it and forwarding traffic, but in the process, was spiking CPU in a manner that was not immediately evident from the NMS. Fixing the next-hop resolved the issue, as would improving service provisioning and troubleshooting manuals :-).
Interesting. I can see how hard this one is to debug as even a relatively small of traffic pointing at the static route would be enough to make the CPU spikes.
> Like Saku says, there's always something, and attention to it will be granted depending on how much visible pain it causes.
Yep. Makes absolute sense.
More information about the NANOG