Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Thu Jul 8 17:19:43 UTC 2021

On Thu, Jul 8, 2021 at 8:32 AM Saku Ytti <saku at ytti.fi> wrote:
>
> On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent <lvanbever at ethz.ch> wrote:
>
> > Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?”
>
> Network experiences gray failures all the time, and I almost never
> care, unless a customer does. If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
>

I think that some of it depends on the type of failure -- for example,
some devices hash packets across an internal switch fabric, and so the
failure manifests as persistent issues to a specific 5-tuple (or
between a pair of 5-tuples). If this affects one in a thousand flows
it is likely more annoying than one in a thousand random packets being
dropped.

But, yes, all networks drop some set of packets some percentage of the
time (cue the "SEU caused by cosmic rays" response :-))

W

> Fixing these can take months of working with vendors and attempts to
> remedy will usually cause planned or unplanned outages. So it rarely
> makes sense to try to fix as they usually impact a trivial amount of
> traffic.
>
> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.
>
> --
>   ++ytti

-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra