400G forwarding - how does it work?

Mon Jul 25 18:51:07 UTC 2022

Thanks for the responses Chris, Saku…

On Mon, 25 Jul 2022 at 15:17, Chris Adams <cma at cmadams.net> wrote:
>
> Once upon a time, James Bensley <jwbensley+nanog at gmail.com> said:
> > The obvious answer is that it's not magic and my understanding is
> > fundamentally flawed, so please enlighten me.
>
> So I can't answer to your specific question, but I just wanted to say
> that your CPU analysis is simplistic and doesn't really match how CPUs
> work now.

It wasn't a CPU analysis because switching ASICs != CPUs.

I am aware of the x86 architecture, but know little of network ASICs,
so I was deliberately trying to not apply my x86 knowledge here, in
case it sent me down the wrong path. You made references towards
typical CPU features;

> For example, it might take 4 times as long to process the first packet,
> but as long as the hardware can handle 4 packets in a queue, you'll get
> a packet result every cycle after that, without dropping anything.  So
> maybe the first result takes 12 cycles, but then you can keep getting a
> result every 3 cycles as long as the pipeline is kept full.

Yes, in the x86/x64 CPU world keeping the instruction cache and data
cache hot indeed results in optimal performance, and as you say modern
CPUs use parallel pipelines amongst other techniques like branch
prediction, SIMD, (N)UMA, and so on, but I would assume (because I
don’t know) that not all of the x86 feature set map nicely to packet
processing in ASICs (VPP uses these techniques on COTS CPUs, to
emulate a fixed pipeline, rather than run to completion model).

You and Saku both suggest that heavy parallelism is the magic source;

> Something can be "line rate" but not push the first packet
> through in the shortest time.

On Mon, 25 Jul 2022 at 15:16, Saku Ytti <saku at ytti.fi> wrote:
> I.e. say JNPR Trio PPE has many threads, and only one thread is
> running, rest of the threads are waiting for answers from memory. That
> is, once we start pushing packets through the device, it takes a long
> ass time (like single digit microseconds) before we see any packets
> out. 1000x longer than your calculated single digit nanoseconds.

In principal I accept this idea. But lets try and do the maths, I'd
like to properly understand;

The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my
example scenario was a single J2 chip in a 12x400G device. If each
port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every
6.08 nanoseconds coming in. What kind of parallelism is required to
stop from ingress dropping?

It takes say 5 microseconds to process and forward a packet (seems
reasonable looking at some Arista data sheets which use J2 variants),
which means we need to be operating on 5,000ns / 6.08ns == 822 packets
per port simultaneously, so 9868 packets are being processed across
all 12 ports simultaneously, to stop ingress dropping on all
interfaces.

I think the latest generation Trio has 160 PPEs per PFE, but I’m not
sure how many threads per PPE. Older generations had 20
threads/contexts per PPE, so if it hasn’t increased that would make
for 3200 threads in total. That is a 1.6Tbps FD chip, although not
apples to apples of course, Trio is run to completion too.

The Nokia FP5 has 1,200 cores (I have no idea how many threads per
core) and is rated for 4.8Tbps FD. Again doing something quite
different to a J2 chip, again its RTC.

J2 is a partially-fixed pipeline but slightly programmable if I have
understood correctly, but definitely at the other end of the spectrum
compared to RTC. So are we to surmise that a J2 chip has circa 10k
parallel pipelines, in order to process 9868 packets in parallel?

I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).

Cheers,
James.