<div dir="ltr"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">It wasn't a CPU analysis because switching ASICs != CPUs.<br><br>I am aware of the x86 architecture, but know little of network ASICs,<br>so I was deliberately trying to not apply my x86 knowledge here, in<br>case it sent me down the wrong path. You made references towards<br>typical CPU features;<br></blockquote><div><br></div><div>A CPU is 'jack of all trades, master of none'. An ASIC is 'master of one specific thing'. </div><div><br></div><div>If a given feature or design paradigm found in a CPU fits with the use case the ASIC is being designed for, there's no reason it cannot be used. </div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 25, 2022 at 2:52 PM James Bensley <<a href="mailto:jwbensley%2Bnanog@gmail.com" target="_blank">jwbensley+nanog@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thanks for the responses Chris, Saku…<br>

<br>

On Mon, 25 Jul 2022 at 15:17, Chris Adams <<a href="mailto:cma@cmadams.net" target="_blank">cma@cmadams.net</a>> wrote:<br>

><br>

> Once upon a time, James Bensley <<a href="mailto:jwbensley%2Bnanog@gmail.com" target="_blank">jwbensley+nanog@gmail.com</a>> said:<br>

> > The obvious answer is that it's not magic and my understanding is<br>

> > fundamentally flawed, so please enlighten me.<br>

><br>

> So I can't answer to your specific question, but I just wanted to say<br>

> that your CPU analysis is simplistic and doesn't really match how CPUs<br>

> work now.<br>

<br>

It wasn't a CPU analysis because switching ASICs != CPUs.<br>

<br>

I am aware of the x86 architecture, but know little of network ASICs,<br>

so I was deliberately trying to not apply my x86 knowledge here, in<br>

case it sent me down the wrong path. You made references towards<br>

typical CPU features;<br>

<br>

> For example, it might take 4 times as long to process the first packet,<br>

> but as long as the hardware can handle 4 packets in a queue, you'll get<br>

> a packet result every cycle after that, without dropping anything.  So<br>

> maybe the first result takes 12 cycles, but then you can keep getting a<br>

> result every 3 cycles as long as the pipeline is kept full.<br>

<br>

Yes, in the x86/x64 CPU world keeping the instruction cache and data<br>

cache hot indeed results in optimal performance, and as you say modern<br>

CPUs use parallel pipelines amongst other techniques like branch<br>

prediction, SIMD, (N)UMA, and so on, but I would assume (because I<br>

don’t know) that not all of the x86 feature set map nicely to packet<br>

processing in ASICs (VPP uses these techniques on COTS CPUs, to<br>

emulate a fixed pipeline, rather than run to completion model).<br>

<br>

You and Saku both suggest that heavy parallelism is the magic source;<br>

<br>

> Something can be "line rate" but not push the first packet<br>

> through in the shortest time.<br>

<br>

On Mon, 25 Jul 2022 at 15:16, Saku Ytti <<a href="mailto:saku@ytti.fi" target="_blank">saku@ytti.fi</a>> wrote:<br>

> I.e. say JNPR Trio PPE has many threads, and only one thread is<br>

> running, rest of the threads are waiting for answers from memory. That<br>

> is, once we start pushing packets through the device, it takes a long<br>

> ass time (like single digit microseconds) before we see any packets<br>

> out. 1000x longer than your calculated single digit nanoseconds.<br>

<br>

In principal I accept this idea. But lets try and do the maths, I'd<br>

like to properly understand;<br>

<br>

The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my<br>

example scenario was a single J2 chip in a 12x400G device. If each<br>

port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every<br>

6.08 nanoseconds coming in. What kind of parallelism is required to<br>

stop from ingress dropping?<br>

<br>

It takes say 5 microseconds to process and forward a packet (seems<br>

reasonable looking at some Arista data sheets which use J2 variants),<br>

which means we need to be operating on 5,000ns / 6.08ns == 822 packets<br>

per port simultaneously, so 9868 packets are being processed across<br>

all 12 ports simultaneously, to stop ingress dropping on all<br>

interfaces.<br>

<br>

I think the latest generation Trio has 160 PPEs per PFE, but I’m not<br>

sure how many threads per PPE. Older generations had 20<br>

threads/contexts per PPE, so if it hasn’t increased that would make<br>

for 3200 threads in total. That is a 1.6Tbps FD chip, although not<br>

apples to apples of course, Trio is run to completion too.<br>

<br>

The Nokia FP5 has 1,200 cores (I have no idea how many threads per<br>

core) and is rated for 4.8Tbps FD. Again doing something quite<br>

different to a J2 chip, again its RTC.<br>

<br>

J2 is a partially-fixed pipeline but slightly programmable if I have<br>

understood correctly, but definitely at the other end of the spectrum<br>

compared to RTC. So are we to surmise that a J2 chip has circa 10k<br>

parallel pipelines, in order to process 9868 packets in parallel?<br>

<br>

I have no frame of reference here, but in comparison to Gen 6 Trio of<br>

NP5, that seems very high to me (to the point where I assume I am<br>

wrong).<br>

<br>

Cheers,<br>

James.<br>

</blockquote></div>