400G forwarding - how does it work?

Fri Aug 5 17:31:26 UTC 2022

Disclaimer:  I work for Cisco on a bunch of silicon.  I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs.  There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details.  Your mileage will vary...   ;-)

If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc.  A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet.  Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work.  Those interconnects take up silicon space, which equates to higher cost and power.  

Packaging two cores on a single device is beneficial in that you only have one physical chip to work with instead of two.  This often simplifies the board designers' job, and is often lower power than two separate chips.  This starts to break down as you get to exceptionally large chips as you bump into the various physical/reticle limitations of how large a chip you can actually build.  With newer packaging technology (2.5D chips, HBM and similar memories, chiplets down the road, etc) this becomes even more complicated, but the answer to "why would you put two XYZs on a package?" is that it's just cheaper and lower power from a system standpoint (and often also from a pure silicon standpoint...)

Buffer designs are *really* hard in modern high speed chips, and there are always lots and lots of tradeoffs.  The "ideal" answer is an extremely large block of memory that ALL of the forwarding/queueing elements have fair/equal access to... but this physically looks more or less like a full mesh between the memory/buffering subsystem and all the forwarding engines, which becomes really unwieldly (expensive!) from a design standpoint.  The amount of memory you can practically put on the main NPU die is on the order of 20-200 **mega** bytes, where a single stack of HBM memory comes in at 4GB -- it's literally 100x the size.  Figuring out which side of this gigantic gulf you want to live on is a super important part of the basic architecture and also drives lots of other decisions down the line... once you've decided how much buffering memory you're willing/able to put down, the next challenge is coming up with ways to provide access to that memory from all the different potential clients.  It's a LOT easier to wire up/design a chip where you have four separate pipelines/cores/whatever and each one of them accesses 1/4 of the buffer memory... but that also means that any given port only has access to 1/4 of the memory for burst absorption.  Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs.  

--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail.com at nanog.org> On Behalf Of Saku Ytti
Sent: Friday, August 5, 2022 3:16 AM
To: Jeff Tantsura <jefftant.ietf at gmail.com>
Cc: NANOG <nanog at nanog.org>; Jeff Doyle <jdoyle at juniper.net>
Subject: Re: 400G forwarding - how does it work?

Thank you for this.

I wish there would have been a deeper dive to the lookup side. My open questions

a) Trio model of packet stays in single PPE until done vs. FP model of line-of-PPE (identical cores). I don't understand the advantages of the FP model, the Trio model advantages are clear to me. Obviously the FP model has to have some advantages, what are they?

b) What exactly are the gains of putting two trios on-package in Trio6, there is no local-switching between WANs of trios in-package, they are, as far as I can tell, ships in the night, packets between trios go via fabric, as they would with separate Trios. I can understand the benefit of putting trio and HBM2 on the same package, to reduce distance so wattage goes down or frequency goes up.

c) What evolution they are thinking for the shallow ingress buffers for Trio6. The collateral damage potential is significant, because WAN which asks most, gets most, instead each having their fair share, thus potentially arbitrarily low rate WAN ingress might not get access to ingress buffer causing drop. Would it be practical in terms of wattage/area to add some sort of preQoS towards the shallow ingress buffer, so each WAN ingress has a fair guaranteed-rate to shallow buffers?

On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.ietf at gmail.com> wrote:
>
> Apologies for garbage/HTMLed email, not sure what happened (thanks 
> Brian F for letting me know).
> Anyway, the podcast with Juniper (mostly around Trio/Express) has been 
> broadcasted today and is available at 
> https://www.youtube.com/watch?v=1he8GjDBq9g
> Next in the pipeline are:
> Cisco SiliconOne
> Broadcom DNX (Jericho/Qumran/Ramon)
> For both - the guests are main architects of the silicon
>
> Enjoy
>
>
> On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf at gmail.com> wrote:
> >
> > Hey,
> >
> >
> >
> > This is not an advertisement but an attempt to help folks to better understand networking HW.
> >
> >
> >
> > Some of you might know (and love 😊) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.
> >
> >
> >
> > Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining 😊), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.
> >
> > More to come, stay tuned.
> >
> > Live feed: https://lnkd.in/gk2x2ezZ
> >
> > Between 0x2 nerds playlist, videos will be published to: 
> > https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2
> > yJB7
> >
> >
> >
> > Cheers,
> >
> > Jeff
> >
> >
> >
> > From: James Bensley
> > Sent: Wednesday, July 27, 2022 12:53 PM
> > To: Lawrence Wobker; NANOG
> > Subject: Re: 400G forwarding - how does it work?
> >
> >
> >
> > On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker at gmail.com> wrote:
> >
> > > So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way.  I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.
> >
> >
> >
> > Thanks for the response Lawrence.
> >
> >
> >
> > The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
> >
> > J2 to have something similar (as someone already mentioned, most 
> > chips
> >
> > I've seen are in the 1-1.5Ghz range), so in this case "only" 2
> >
> > pipelines would be needed to maintain the headline 2Bpps rate of the
> >
> > J2, or even just 1 if they have managed to squeeze out two packets 
> > per
> >
> > cycle through parallelisation within the pipeline.
> >
> >
> >
> > Cheers,
> >
> > James.
> >
> >

--
  ++ytti