interesting troubleshooting

Sun Mar 22 09:52:10 UTC 2020

On Sun, 22 Mar 2020 at 09:41, Mark Tinka <mark.tinka at seacom.mu> wrote:

> We weren't as successful (MX480 ingress/egress devices transiting a CRS
> core).

So you're not even talking about multivendor, as both ends are JNPR?
Or are you confusing entropy label with FAT?

Transit doesn't know anything about FAT, FAT is PW specific and is
only signalled between end-points. Entropy label applies to all
services and is signalled to adjacent device. Transit just sees 1
label longer label stack, with hope (not promise) that transit uses
the additional label for hashing.

> In the end, we updated our policy to avoid running LAG's in the
> backbone, and going ECMP instead. Even with l2vpn payloads, that spreads
> a lot more evenly.

You really should be doing CW+FAT. And looking your other email, dear
god, don't do per-packet outside some unique application where you
control the TCP stack :). Modern Windows, Linux, MacOS TCP stack
considers out-of-order as packet loss, this is not inherent to TCP, if
you can change TCP congestion control, you can make reordering
entirely irrelevant to TCP. But in most cases of course we do not
control TCP algo, so per-packet will not work one bit.

Like OP, you should enable adaptive. This thread is conflating few
different balancing issues, so I'll take the opportunity to classify
them.

1. Bad hashing implementation
    1.1 Insufficient amount of hash-results
        Think say 6500/7600, what if you only have 8 hash-results and
7 interfaces? You will inherently have 2x more traffic on one
interface
    1.2 Bad algorithm
        Different hashes have different use-cases, and we often try to
think golden-hammer for hashes (like we tend to use bad hashes for
password hashing, like SHA etc, when goal of SHA is to be fast in HW,
which is opposite to the goal of PW hash, as you want it to be slow).
Equally since the day1 of ethernet silicon, we've had CRC in the
slicion, and it has since then been grandfathered hash load-balancing
hash. But CRC goals are completely different to hash-algo goals, CRC
does not try, and does not need good diffusion quality, hash-algo only
needs perfect diffusion, nothing else matters. CRC has terrible
diffusion quality, instead of implementing specific good-diffusion
hash in silicon vendors do stuff like rot(crcN(x), crcM(x)) which
greatly improves diffusion, but is still very bad diffusion compared
to hash algos which are designed for perfect diffusion. Poor diffusion
means you have different flow count in egressInts. As I can't do math,
I did monte-carlo simulation to see what type of bias should we expect
even with _perfect_ diffusion:

- Here we have 3 egressInt and we run monte carlo until we stop
getting worse Bias (of course if we wait for heath death of universe,
we will see eventually see every flow in singleInt, even with perfect
diffusion). But in normal situation if you see worse bias, you should
blame poor diffusion quality of vendor algo, if you see bias of this
or lower, it's probably not diffusion you should blame

Flows | MaxBias | Example Flow Count per Int
1k | 6.9% | 395, 341, 264
10k | 2.2% | 3490, 3396, 3114
100k |0.6% |  33655, 32702, 33643
1M | 0.2% | 334969, 332424, 332607

2. Elephant flows
    Even if we assume perfect diffusion, so each egressInt gets
exactly same amount of flows, the flows may still be wildly different
bps, and there is nothing we do by tuning the hash algo to fix this.
The prudent fix here is to have mapping-table between hash-result and
egressInt, so that we can inject bias, not to have fair distribution
between hash-result and egressInt, but to have fewer hash-results
point to the congested egressInt. This is easy, ~free to implement in
HW. JNPR does it, NOK is happy to implement should customer want it.
This of course also fixes bad algorithmic diffusion, so it's really
really great tool to have in your toolbox and I think everyone should
be running this feature.

3. Incorrect key recovery
   Balancing is promise that we know which keys identify a flow. In
common case this is simple problem, but there are lot of complexity
particularly in MPLS transit. The naive/simple problem everyone knows
about is pseudowire flow in-transit parsed as IPv4/IPv6 flow, when
DMAC starts with 4 or 6. Some vendors (JNPR, Huawei) do additional
checks, like perhaps IP checksum or IP packet length, but this is
actually making the situation worse, the problem triggers far less
often, but when it triggers, it will be so much more exotic, as now
you have underlaying frame where by luck you also have your IP packet
length supposedly correct. So you can end up in weird situations where
end-customers network works perfectly, then they implement IPSEC from
all hosts to concentrator, still riding over your backbone, and now
suddently one customer host stops working, after enabling IPSEC,
everything else works. The chances that this trouble-ticket even ever
ends on your table is low and the possibility that based on the
problem description  you'd blame the backbone is negligible. Customer
will just end up renumberign the host or replacing it's DMAC or
something and no one will ever know why it was broken.
So it's crucial not to do payload heuristics in MPLS transit, as it
cannot be done correctly by-design. FAT and Entropy labels solve this
problem correctly, moving the hash-result generation to the edge,
where you still can do it correctly.

-- 
  ++ytti