Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Saku Ytti saku at
Fri Jul 9 05:45:45 UTC 2021

On Fri, 9 Jul 2021 at 00:01, William Herrin <bill at> wrote:

> I would suggest that your customer does care, but as there is no

Most don't. Somewhat recently we were dropping a non-trivial amount of
packets from a well-known book store due to DMAC failure. This was
unexpected, considering it was an L3 to L3 connection. This was a LACP
bundle with a large number of interfaces and this issue affected just
one interface in the bundle. After we informed the customer about the
problem, while it was still occurring, they could not observe it, they
looked at their stats and whatever it was dropping was being drowned
in the noise, it was not an actionable signal to them. Customer wasn't
willing to remove the broken interface from the bundle, as they could
not observe the problem.

We did migrate that port to a working port and after 3 months we
agreed with the vendor to stop troubleshooting it, vendor can see that
they had misprogrammed their hardware, but they were not able to
figure out why and therefore it is not fixed. Very large amount of
cycles were spent at the vendor and operator, and a small amount of
work (checking TCP resends etc) at customers trying to solve it.

The reason we contacted the customer is because there were quite a
large number of packets we were dropping, I can easily find 100 real
smaller problems we have in the network immediately.

Customer was /not/ wrong, the customer did the exact right thing.
There are a lot of problems, and you can go deep into the rabbit hole
trying to fix problems which are real but don't affect a sufficient
amount of packets to have a meaningful impact on the product quality.


More information about the NANOG mailing list