Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Thu Jul 8 14:22:30 UTC 2021

>
>  If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
>

This. Full stop.

I believe there are very few, if any, production networks in existence in
which have a 0% rate of drops or 'weird shit'.

Monitoring for said drops and weird shit is important, and knowing your
traffic profiles is also important so that when there is an intersection of
'stuff' and 'stuff that noticeably impacts traffic' , you can get to the
bottom of it quickly and figure out what to do.

On Thu, Jul 8, 2021 at 8:31 AM Saku Ytti <saku at ytti.fi> wrote:

> On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent <lvanbever at ethz.ch> wrote:
>
> > Detecting whole-link and node failures is relatively easy nowadays
> (e.g., using BFD). But what about detecting gray failures that only affect
> a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the
> packets? Does your network often experience these gray failures? Are they
> problematic? Do you care? And can we (network researchers) do anything
> about it?”
>
> Network experiences gray failures all the time, and I almost never
> care, unless a customer does. If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
>
> Fixing these can take months of working with vendors and attempts to
> remedy will usually cause planned or unplanned outages. So it rarely
> makes sense to try to fix as they usually impact a trivial amount of
> traffic.
>
> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.
>
> --
>   ++ytti
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20210708/29afe8b2/attachment.html>