400G forwarding - how does it work?

Sun Aug 7 08:43:55 UTC 2022

On Sat, 6 Aug 2022 at 17:08, <ljwobker at gmail.com> wrote:

> For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the forwarding pipeline.  You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing math, and chips for the fabric interfaces.  Over time, we integrated more and more of these things together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more.  Once we got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- on the ASR9k we called these "slices".  This again multiplies the performance you can get, but now both the software and the operators have to deal with the complexity of having multiple things running code where you used to only have one.  Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity.  So now you've got yet ANOTHER layer of abstraction.  If I can attempt to draw out the tree, it looks like this now:

> 1) you have a chassis or a system, which has a bunch of linecards.
> 2) each of those linecards has a bunch of NPUs/ASICs
> 3) each of those NPUs has a bunch of cores/pipelines

Thank you for this. I think we may have some ambiguity here. I'll
ignore multichassis designs, as those went out of fashion, for now.
And describe only 'NPU' not express/brcm style pipeline.

1) you have a chassis with multiple linecards
2) each linecard has 1 or more forwarding packages
3) each package has 1 or more NPUs (Juniper calls these slices, unsure
if EZchip vocabulary is same here)
4) each NPU has 1 or more identical cores (well, I can't really name
any with 1 core, I reckon, NPU like GPU pretty inherently has many
many cores, and unlike some in this thread, I don't think they ever
are ARM instruction set, that makes no sense, you create instruction
set targeting the application at hand which ARM instruction set is
not, but maybe some day we have some forwarding-IA, allowing customers
to provide ucode that runs on multiple targets, but this would reduce
pace of innovation)

Some of those NPU core architectures are flat, like Trio, where a
single core handles the entire packet. Where other core architectures,
like FP are matrices, where you have multiple lines and packet picks 1
of the lines and traverses each core in line. (FP has much more cores
in line, compared to leaba/pacific stuff)

-- 
  ++ytti