Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Fri Jul 9 21:16:41 UTC 2021

On Thu, Jul 8, 2021 at 4:03 PM William Herrin <bill at herrin.us> wrote:
>
> On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku at ytti.fi> wrote:
> > Network experiences gray failures all the time, and I almost never
> > care, unless a customer does.
>
> I would suggest that your customer does care, but as there is no
> simple test to demonstrate gray failures, your customer rarely makes
> it past first tier support to bring the issue to your attention and
> gives up trying. Indeed, name the networks with the worst reputations
> around here and many of them have those reputations because of a
> routine, uncorrected state of gray failure.

Networks originating/receiving the traffic tend to have more
incentives to resolve these issues, which might be not so rare

If you have connection/application level health metrics (e.g. TLS
handshake failures, TCP retransmits), identifying a problem exists is
not too difficult. Having health metrics associated with network paths
can greatly simplify repro. Then it's mostly troubleshooting datapath
issues on your favorite platform.

It takes quite some effort to figure out/collect relevant metrics and
present them in a usable way. Something like connections from PoP A to
destination ASN/prefix (via interface X) had TLS handshake failure
rate increased from 0.02% to 1% is a good starting point for
troubleshooting (may or may not be a network issue, the
origin/receiver probably wants to fix it regardless).

Things can get more complicated when traffic crosses network
boundaries with things you don't have visibility into (IX fabric,
remote peering, another networks' optical systems, complicated setups
like stateful firewall / MC-LAG)