400G forwarding - how does it work?

Mon Jul 25 23:49:19 UTC 2022

>
> It wasn't a CPU analysis because switching ASICs != CPUs.
>
> I am aware of the x86 architecture, but know little of network ASICs,
> so I was deliberately trying to not apply my x86 knowledge here, in
> case it sent me down the wrong path. You made references towards
> typical CPU features;
>

A CPU is 'jack of all trades, master of none'. An ASIC is 'master of one
specific thing'.

If a given feature or design paradigm found in a CPU fits with the use case
the ASIC is being designed for, there's no reason it cannot be used.

On Mon, Jul 25, 2022 at 2:52 PM James Bensley <jwbensley+nanog at gmail.com>
wrote:

> Thanks for the responses Chris, Saku…
>
> On Mon, 25 Jul 2022 at 15:17, Chris Adams <cma at cmadams.net> wrote:
> >
> > Once upon a time, James Bensley <jwbensley+nanog at gmail.com> said:
> > > The obvious answer is that it's not magic and my understanding is
> > > fundamentally flawed, so please enlighten me.
> >
> > So I can't answer to your specific question, but I just wanted to say
> > that your CPU analysis is simplistic and doesn't really match how CPUs
> > work now.
>
> It wasn't a CPU analysis because switching ASICs != CPUs.
>
> I am aware of the x86 architecture, but know little of network ASICs,
> so I was deliberately trying to not apply my x86 knowledge here, in
> case it sent me down the wrong path. You made references towards
> typical CPU features;
>
> > For example, it might take 4 times as long to process the first packet,
> > but as long as the hardware can handle 4 packets in a queue, you'll get
> > a packet result every cycle after that, without dropping anything.  So
> > maybe the first result takes 12 cycles, but then you can keep getting a
> > result every 3 cycles as long as the pipeline is kept full.
>
> Yes, in the x86/x64 CPU world keeping the instruction cache and data
> cache hot indeed results in optimal performance, and as you say modern
> CPUs use parallel pipelines amongst other techniques like branch
> prediction, SIMD, (N)UMA, and so on, but I would assume (because I
> don’t know) that not all of the x86 feature set map nicely to packet
> processing in ASICs (VPP uses these techniques on COTS CPUs, to
> emulate a fixed pipeline, rather than run to completion model).
>
> You and Saku both suggest that heavy parallelism is the magic source;
>
> > Something can be "line rate" but not push the first packet
> > through in the shortest time.
>
> On Mon, 25 Jul 2022 at 15:16, Saku Ytti <saku at ytti.fi> wrote:
> > I.e. say JNPR Trio PPE has many threads, and only one thread is
> > running, rest of the threads are waiting for answers from memory. That
> > is, once we start pushing packets through the device, it takes a long
> > ass time (like single digit microseconds) before we see any packets
> > out. 1000x longer than your calculated single digit nanoseconds.
>
> In principal I accept this idea. But lets try and do the maths, I'd
> like to properly understand;
>
> The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my
> example scenario was a single J2 chip in a 12x400G device. If each
> port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every
> 6.08 nanoseconds coming in. What kind of parallelism is required to
> stop from ingress dropping?
>
> It takes say 5 microseconds to process and forward a packet (seems
> reasonable looking at some Arista data sheets which use J2 variants),
> which means we need to be operating on 5,000ns / 6.08ns == 822 packets
> per port simultaneously, so 9868 packets are being processed across
> all 12 ports simultaneously, to stop ingress dropping on all
> interfaces.
>
> I think the latest generation Trio has 160 PPEs per PFE, but I’m not
> sure how many threads per PPE. Older generations had 20
> threads/contexts per PPE, so if it hasn’t increased that would make
> for 3200 threads in total. That is a 1.6Tbps FD chip, although not
> apples to apples of course, Trio is run to completion too.
>
> The Nokia FP5 has 1,200 cores (I have no idea how many threads per
> core) and is rated for 4.8Tbps FD. Again doing something quite
> different to a J2 chip, again its RTC.
>
> J2 is a partially-fixed pipeline but slightly programmable if I have
> understood correctly, but definitely at the other end of the spectrum
> compared to RTC. So are we to surmise that a J2 chip has circa 10k
> parallel pipelines, in order to process 9868 packets in parallel?
>
> I have no frame of reference here, but in comparison to Gen 6 Trio of
> NP5, that seems very high to me (to the point where I assume I am
> wrong).
>
> Cheers,
> James.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20220725/f75bacc6/attachment.html>