400G forwarding - how does it work?

Tue Jul 26 19:39:23 UTC 2022

>
>
> "Pipeline" in the context of networking chips is not a terribly
> well-defined term.  In some chips, you'll have a pipeline that is built
> from very rigid hardware logic blocks -- the first block does exactly one
> part of the packet forwarding, then hands the packet (or just the header
> and metadata) to the second block, which does another portion of the
> forwarding.  You build the pipeline out of as many blocks as you need to
> solve your particular networking problem, and voila!

"Pipeline", in the context of networking chips, is not a terribly
well-defined term!  In some chips, you'll have an almost-literal pipeline
that is built from very rigid hardware logic blocks.  The first block does
exactly one part of the packet forwarding, then hands the packet (or just
the header and metadata) to the second block, which does another portion of
the forwarding.  You build the pipeline out of as many blocks as you need
to solve your particular networking problem, and voila!
The advantages here is that you can make things very fast and power
efficient, but they aren't all that flexible, and deity help you if you
ever need to do something in a different order than your pipeline!

You can also build a "pipeline" out of software functions - write up some
Python code (because everyone loves Python, right?) where function A calls
function B and so on.  At some level, you've just build a pipeline out of
different software functions.  This is going to be a lot slower (C code
will be faster but nowhere near as fast as dedicated hardware) but it's WAY
more flexible.  You can more or less dynamically build your "pipeline" on a
packet-by-packet basis, depending on what features and packet data you're
dealing with.

"Microcode" is really just a term we use for something like "really
optimized and limited instruction sets for packet forwarding".  Just like
an x86 or an ARM has some finite set of instructions that it can execute,
so do current networking chips.  The larger that instruction space is and
the more combinations of those instructions you can store, the more
flexible your code is.  Of course, you can't make that part of the chip
bigger without making something else smaller, so there's another tradeoff.

MOST current chips are really a hybrid/combination of these two extremes.
You have some set of fixed logic blocks that do exactly One Set Of Things,
and you have some other logic blocks that can be reconfigured to do A Few
Different Things.  The degree to which the programmable stuff is
programmable is a major input to how many different features you can do on
the chip, and at what speeds.  Sometimes you can use the same hardware
block to do multiple things on a packet if you're willing to sacrifice some
packet rate and/or bandwidth.  The constant "law of physics" is that you
can always do a given function in less power/space/cost if you're willing
to optimize for that specific thing -- but you're sacrificing flexibility
to do it.  The more flexibility ("programmability") you want to add to a
chip, the more logic and memory you need to add.

>From a performance standpoint, on current "fast" chips, many (but certainly
not all) of the "pipelines" are designed to forward one packet per clock
cycle for "normal" use cases.  (Of course we sneaky vendors get to decide
what is normal and what's not, but that's a separate issue...)  So if I
have a chip that has one pipeline and it's clocked at 1.25Ghz, that means
that it can forward 1.25 billion packets per second.  Note that this does
NOT mean that I can forward a packet in "a one-point-two-five-billionth of
a second" -- but it does mean that every clock cycle I can start on a new
packet and finish another one.  The length of the pipeline impacts the
latency of the chip, although this part of the latency is often a rounding
error compared to the number of times I have to read and write the packet
into different memories as it goes through the system.

So if this pipeline can do 1.25 billion PPS and I want to be able to
forward 10BPPS, I can build a chip that has 8 of these pipelines and get my
performance target that way.  I could also build a "pipeline" that
processes multiple packets per clock, if I have one that does 2
packets/clock then I only need 4 of said pipelines... and so on and so
forth.  The exact details of how the pipelines are constructed and how much
parallelism I built INSIDE a pipeline as opposed to replicating pipelines
is sort of Gooky Implementation Details, but it's a very very important
part of doing the chip level architecture as those sorts of decisions drive
lots of Other Important Decisions in the silicon design...

--lj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20220726/ce05ea97/attachment.html>