400G forwarding - how does it work?
ljwobker at gmail.com
ljwobker at gmail.com
Sat Aug 6 14:08:35 UTC 2022
I don't think I can add much here to the FP and Trio specific questions, for obvious reasons... but ultimately it comes down to a set of tradeoffs where some of the big concerns are things like "how do I get the forwarding state I need back and forth to the things doing the processing work" -- that's an insane level oversimplification, as a huge amount of engineering time goes into those choices.
I think the "revolutionary-ness" (to vocabulate a useful word?) of putting multiple cores or whatever onto a single package is somewhat in the eye of the beholder. The vast majority of customers would never know nor care whether a chip on the inside was implemented as two parallel "cores" or whether it was just one bigger "core" that does twice the amount of work in the same time. But to the silicon designer, and to a somewhat lesser extent the people writing the forwarding and associated chip-management code, it's definitely a big big deal. Also, having the ability to put two cores down on a given chip opens the door to eventually doing MORE than two cores, and if you really stretch your brain you get to where you might be able to put down "N" pipelines.
This is the story of integration: back in the day we built systems where everything was forwarded on a single CPU. From a performance standpoint all we cared about was the clock rate and how much work was required to forward a packet. Divide the second number by the first, and you get your answer. In the late 90's we built systems (the 7500 for me) that were distributed, so now we had a bunch of CPUs on linecards running that code. Horizontal scaling -- sort of. In the early 2000's the GSR came along and now we're doing forwarding in hardware, which is an order or two faster, but a whole bunch of features are now too complex to do in hardware, so they go over the side and people have to adapt. To the best of my knowledge, TCP intercept has never come back...
For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the forwarding pipeline. You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing math, and chips for the fabric interfaces. Over time, we integrated more and more of these things together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more. Once we got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- on the ASR9k we called these "slices". This again multiplies the performance you can get, but now both the software and the operators have to deal with the complexity of having multiple things running code where you used to only have one. Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity. So now you've got yet ANOTHER layer of abstraction. If I can attempt to draw out the tree, it looks like this now:
1) you have a chassis or a system, which has a bunch of linecards.
2) each of those linecards has a bunch of NPUs/ASICs
3) each of those NPUs has a bunch of cores/pipelines
And all of this stuff has to be managed and tracked by the software. If I've got a system with 16 linecards, and each of those has 4 NPUs, and each of THOSE has 4 cores - I've got over *two hundred and fifty* separate things forwarding packets at the same time. Now a lot of the info they're using is common (the FIB is probably the same for all these entities...) but some of it is NOT. There's no value in wasting memory for the encapsulation data to host XXX if I know that none of the ports on my given NPU/core are going to talk to that host, right? So - figuring out how to manage the *state locality* becomes super important. And yes, this code breaks like all code, but no one has figured out any better way to scale up the performance. If you have a brilliant idea here that will get me the performance of 250+ things running in parallel but the simplicity of it looking and acting like a single thing to the rest of the world, please find an angel investor and we'll get phenomenally rich together.
From: Saku Ytti <saku at ytti.fi>
Sent: Saturday, August 6, 2022 1:38 AM
To: ljwobker at gmail.com
Cc: Jeff Tantsura <jefftant.ietf at gmail.com>; NANOG <nanog at nanog.org>; Jeff Doyle <jdoyle at juniper.net>
Subject: Re: 400G forwarding - how does it work?
On Fri, 5 Aug 2022 at 20:31, <ljwobker at gmail.com> wrote:
> Disclaimer: I work for Cisco on a bunch of silicon. I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs. There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details. Your mileage will vary... ;-)
I expect it may come to this, my question may be too specific to be answered without violating some NDA.
> If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc. A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet. Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work. Those interconnects take up silicon space, which equates to higher cost and power.
While an interesting answer, that is, the statement is, cost of giving access to memory for cores versus having a more complex to program pipeline of cores is a balanced tradeoff, I don't think it applies to my specific question, while may apply to generic questions. We can roughly think of FP having a similar amount of lines as Trio has PPEs, therefore, a similar number of cores need access to memory, and possibly higher number, as more than 1 core in line will need memory access.
So the question is more, why a lot of less performant cores, where performance is achieved by making pipeline, compared to fewer performant cores, where individual cores will work on packet to completion, when the former has a similar number of core lines as latter has cores.
> Packaging two cores on a single device is beneficial in that you only
> have one physical chip to work with instead of two. This often
> simplifies the board designers' job, and is often lower power than two
> separate chips. This starts to break down as you get to exceptionally
> large chips as you bump into the various physical/reticle limitations
> of how large a chip you can actually build. With newer packaging
> technology (2.5D chips, HBM and similar memories, chiplets down the
> road, etc) this becomes even more complicated, but the answer to "why
> would you put two XYZs on a package?" is that it's just cheaper and
> lower power from a system standpoint (and often also from a pure
> silicon standpoint...)
Thank you for this, this does confirm that benefits aren't perhaps as revolutionary as the presentation of thread proposed, presentation divided Trio evolution to 3 phases, and this multiple trios on package was presented as one of those big evolutions, and perhaps some other division of generations could have been more communicative.
> Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs.
I choose to read this as 'where a lot of innovation happens, a lot of mistakes happen'. Hopefully we'll figure out a good answer here soon, as the answers vendors are ending up with are becoming increasingly visible compromises in the field. I suspect a large part of this is that cloudy shops represent, if not disproportionate revenue, disproportionate focus and their networks tend to be a lot more static in config and traffic than access/SP networks. And when you have that quality, you can make increasingly broad assumptions, assumptions which don't play as well in SP networks.
More information about the NANOG