400G forwarding - how does it work?

Saku Ytti saku at ytti.fi
Sat Aug 6 05:37:50 UTC 2022

On Fri, 5 Aug 2022 at 20:31, <ljwobker at gmail.com> wrote:

Hey LJ,

> Disclaimer:  I work for Cisco on a bunch of silicon.  I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs.  There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details.  Your mileage will vary...   ;-)

I expect it may come to this, my question may be too specific to be
answered without violating some NDA.

> If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc.  A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet.  Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work.  Those interconnects take up silicon space, which equates to higher cost and power.

While an interesting answer, that is, the statement is, cost of giving
access to memory for cores versus having a more complex to program
pipeline of cores is a balanced tradeoff, I don't think it applies to
my specific question, while may apply to generic questions. We can
roughly think of FP having a similar amount of lines as Trio has PPEs,
therefore, a similar number of cores need access to memory, and
possibly higher number, as more than 1 core in line will need memory
So the question is more, why a lot of less performant cores, where
performance is achieved by making pipeline, compared to fewer
performant cores, where individual  cores will work on packet to
completion, when the former has a similar number of core lines as
latter has cores.

> Packaging two cores on a single device is beneficial in that you only have one physical chip to work with instead of two.  This often simplifies the board designers' job, and is often lower power than two separate chips.  This starts to break down as you get to exceptionally large chips as you bump into the various physical/reticle limitations of how large a chip you can actually build.  With newer packaging technology (2.5D chips, HBM and similar memories, chiplets down the road, etc) this becomes even more complicated, but the answer to "why would you put two XYZs on a package?" is that it's just cheaper and lower power from a system standpoint (and often also from a pure silicon standpoint...)

Thank you for this, this does confirm that benefits aren't perhaps as
revolutionary as the presentation of thread proposed, presentation
divided Trio evolution to 3 phases, and this multiple trios on package
was presented as one of those big evolutions, and perhaps some other
division of generations could have been more communicative.

> Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs.

I choose to read this as 'where a lot of innovation happens, a lot of
mistakes happen'. Hopefully we'll figure out a good answer here soon,
as the answers vendors are ending up with are becoming increasingly
visible compromises in the field. I suspect a large part of this is
that cloudy shops represent, if not disproportionate revenue,
disproportionate focus and their networks tend to be a lot more static
in config and traffic than access/SP networks. And when you have that
quality, you can make increasingly broad assumptions, assumptions
which don't play as well in SP networks.


More information about the NANOG mailing list