400G forwarding - how does it work?

Tue Jul 26 20:11:52 UTC 2022

As Lincoln said - all of us directly working with BCM/other silicon vendors have signed numerous NDAs.
However if you ask a well crafted question - there’s always a way to talk about it ;-)

In general, if we look at the whole spectrum, on one side there’re massively parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC (Spider).
On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at its extreme (max speed/radix - min features) moving with BCM Trident, Innovium, Barefoot(quite different animal wrt programmability), etc - usually shallow on chip buffer only (100-200M).

In between we have got so called programmable pipeline silicon, BCM DNX and Juniper Express are in this category, usually a combo of OCB + off chip memory (most often HBM), (2-6G), usually have line-rate/high scale security/overlay encap/decap capabilities. Usually have highly optimized RTC blocks within a pipeline (RTC within macro). The way and speed to access DBs, memories is evolving with each generation, number/speed of non networking cores(usually ARM)  keeps growing - OAM, INT, local optimizations are primary users of it.

Cheers,
Jeff

> On Jul 25, 2022, at 15:59, Lincoln Dale <ltd at interlink.com.au> wrote:
> 
> 
>> On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+nanog at gmail.com> wrote:
> 
>> On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker at gmail.com> wrote:
>> > This is the parallelism part.  I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput.
>> ...
>> > I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.
>> 
>> What level of parallelism is required to forward 10Bpps? Or 2Bpps like
>> my J2 example :)
> 
> I suspect many folks know the exact answer for J2, but it's likely under NDA to talk about said specific answer for a given thing.
> 
> Without being platform or device-specific, the core clock rate of many network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that doesn't mean a latency of 1 clock ingress-to-egress but rather that every clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS packet rate is achieved by having enough pipelines in parallel to achieve that.
> The number here is often "1" or "0.5" so you can work the number backwards. (e.g. it emits a packet every clock, or every 2nd clock).
> 
> It's possible to build an ASIC/NPU to run a faster clock rate, but gets back to what I'm hand-waving describing as "goldilocks". Look up power vs frequency and you'll see its non-linear.
> Just as CPUs can scale by adding more cores (vs increasing frequency), ~same holds true on network silicon, and you can go wider, multiple pipelines. But its not 10K parallel slices, there's some parallel parts, but there are multiple 'stages' on each doing different things.
> 
> Using your CPU comparison, there are some analogies here that do work:
>  - you have multiple cpu cores that can do things in parallel -- analogous to pipelines
>  - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some DRAM or LLC)  -- maybe some lookup engines, or centralized buffer/memory
>  - most modern CPUs are out-of-order execution, where under-the-covers, a cache-miss or DRAM fetch has a disproportionate hit on performance, so its hidden away from you as much as possible by speculative execution out-of-order
>     -- no direct analogy to this one - it's unlikely most forwarding pipelines do speculative execution like a general purpose CPU does - but they definitely do 'other work' while waiting for a lookup to happen
> 
> A common-garden x86 is unlikely to achieve such a rate for a few different reasons:
>  - packets-in or packets-out go via DRAM then you need sufficient DRAM (page opens/sec, DRAM bandwidth) to sustain at least one write and one read per packet. Look closer at DRAM and see its speed, Pay attention to page opens/sec, and what that consumes.
>  - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least potentially saves you that DRAM write+read per packet
>   - ... but then do e.g. a LPM lookup, and best case that is back to a memory access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes it isn't.
>  - ... do more things to the packet (urpf lookups, counters) and it's yet more lookups.
> 
> Software can achieve high rates, but note that a typical ASIC/NPU does on the order of >100 separate lookups per packet, and 100 counter updates per packet.
> Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in software on generic CPUs is also a series of tradeoffs.
> 
> 
> cheers,
> 
> lincoln.
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20220726/86be1e09/attachment.html>