Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Lukas Tribus lukas at ltri.eu
Thu Jul 8 16:21:50 UTC 2021


Hello,

there is a large eyeball ASN in Southern Europe, single homed to a Tier1
running under the same corporate umbrella, which for about a decade
suffered from periodic blackholing of specific src/dst tuples. The issue
occurred every 6 - 18 months, completely breaking specific production
traffic *for multiple days* (think dead, mission-critical IPsec VPNs for
example). It was never acknowledged on the record, some say this was about
stalled 100G cards. I believe at this point the HW was faced out, but this
was one of the rather infuriating experiences ...

More generally speaking, single link overloads causing PL or even full
blackholing affecting single links (and therefore in a load-balanced
environment: specific tuples) is something that is very frustrating to
troubleshoot and it happens quite a lot in the DFZ. It doesn't show on
monitoring systems, and it is difficult to get past the first level support
in bigger networks because load-balancing decisions and hashing are
difficult concepts for the uninitiated and they will generally refuse to
escalate issues they are unable to reproduce from their specific system
(WORKSFORME). At some point I had a router with an entire /24 configured on
a loopback, just to ping destinations from the same device with different
source IP's, to establish whether there is a load-balancing induced issue
with packet-loss, latency, or full blackholing towards a particular
destination.

Tooling (for troubleshooting), monitoring and education is lacking in this
regard unfortunately.


- lukas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20210708/d5bfbad0/attachment.html>


More information about the NANOG mailing list