Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Warren Kumari warren at
Fri Jul 9 12:51:52 UTC 2021

On Thu, Jul 8, 2021 at 5:04 PM William Herrin <bill at> wrote:
> On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku at> wrote:
> > Network experiences gray failures all the time, and I almost never
> > care, unless a customer does.
> Greetings,
> I would suggest that your customer does care, but as there is no
> simple test to demonstrate gray failures, your customer rarely makes
> it past first tier support to bring the issue to your attention and
> gives up trying. Indeed, name the networks with the worst reputations
> around here and many of them have those reputations because of a
> routine, uncorrected state of gray failure.
> To answer Laurent 's question:
> Yes, gray failures are a regular problem. Yes, most of us care. And
> for the most part we don't have particularly good ways to detect and
> isolate the problems, let alone fix them.

Depending on the actual failure mode, and the architecture of the
device itself, one technique is to run test traffic through the
box/path/whatever while twiddling the source and destination ports,
and sometimes the source IP as well.
This sometimes helps find the issue if there is a bad interface in a
LAG, or in a device which sprays packets/cells across an internal
fabric, etc. If you are really lucky you can convince the vendor to
share how they spray/hash (or, at least demonstrate deterministic
failure and hopefully they can hash and tell which of the N fabric
cards is broken)

Hopefully you noticed the number of weasel words in there...


>  When it's not a clean
> failure we really are driven by: customer says blank is broken, often
> followed by grueling manual effort just to duplicate the problem
> within our view.
> Can network researchers do anything about it? Maybe. Because of the
> end to end principle, only the endpoints understand the state of the
> connection and they don't know the difference capacity and error. They
> mostly process that information locally sharing only limited
> information with the other endpoint. Which means there's not much
> passing over the wire for the middle to examine and learn that there's
> a problem... and when there is it often takes correlating multiple
> packets to understand that a problem exists which, in the stateless
> middle with asymmetric routing, is not usable. The middle can only
> look at its immediate link stats which, when there's a bug, are
> misleading.
> What would you change to dig us out of this hole?
> Regards,
> Bill Herrin
> --
> William Herrin
> bill at

The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra

More information about the NANOG mailing list