400G forwarding - how does it work?

Lincoln Dale ltd at interlink.com.au
Mon Jul 25 22:57:44 UTC 2022

On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+nanog at gmail.com>

> On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker at gmail.com> wrote:
> > This is the parallelism part.  I can take multiple instances of these
> memory/logic pipelines, and run them in parallel to increase the throughput.
> ...
> > I work on/with a chip that can forwarding about 10B packets per second…
> so if we go back to the order-of-magnitude number that I’m doing about
> “tens” of memory lookups for every one of those packets, we’re talking
> about something like a hundred BILLION total memory lookups… and since
> memory does NOT give me answers in 1 picoseconds… we get back to pipelining
> and parallelism.
> What level of parallelism is required to forward 10Bpps? Or 2Bpps like
> my J2 example :)

I suspect many folks know the exact answer for J2, but it's likely under
NDA to talk about said specific answer for a given thing.

Without being platform or device-specific, the core clock rate of many
network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with
a goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that
doesn't mean a latency of 1 clock ingress-to-egress but rather that every
clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS
packet rate is achieved by having enough pipelines in parallel to achieve
The number here is often "1" or "0.5" so you can work the number backwards.
(e.g. it emits a packet every clock, or every 2nd clock).

It's possible to build an ASIC/NPU to run a faster clock rate, but gets
back to what I'm hand-waving describing as "goldilocks". Look up power vs
frequency and you'll see its non-linear.
Just as CPUs can scale by adding more cores (vs increasing frequency),
~same holds true on network silicon, and you can go wider, multiple
pipelines. But its not 10K parallel slices, there's some parallel parts,
but there are multiple 'stages' on each doing different things.

Using your CPU comparison, there are some analogies here that do work:
 - you have multiple cpu cores that can do things in parallel -- analogous
to pipelines
 - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing
some DRAM or LLC)  -- maybe some lookup engines, or centralized
 - most modern CPUs are out-of-order execution, where under-the-covers, a
cache-miss or DRAM fetch has a disproportionate hit on performance, so its
hidden away from you as much as possible by speculative execution
    -- no direct analogy to this one - it's unlikely most forwarding
pipelines do speculative execution like a general purpose CPU does - but
they definitely do 'other work' while waiting for a lookup to happen

A common-garden x86 is unlikely to achieve such a rate for a few different
 - packets-in or packets-out go via DRAM then you need sufficient DRAM
(page opens/sec, DRAM bandwidth) to sustain at least one write and one read
per packet. Look closer at DRAM and see its speed, Pay attention to page
opens/sec, and what that consumes.
 - one 'trick' is to not DMA packets to DRAM but instead have it go into
SRAM of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least
potentially saves you that DRAM write+read per packet
  - ... but then do e.g. a LPM lookup, and best case that is back to a
memory access/packet. Maybe it's in L1/L2/L3 cache, but likely at large
table sizes it isn't.
 - ... do more things to the packet (urpf lookups, counters) and it's yet
more lookups.

Software can achieve high rates, but note that a typical ASIC/NPU does on
the order of >100 separate lookups per packet, and 100 counter updates per
Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in
software on generic CPUs is also a series of tradeoffs.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20220725/aa4682a9/attachment.html>

More information about the NANOG mailing list