400G forwarding - how does it work?
diptanshu.singh at gmail.com
Tue Jul 26 19:54:22 UTC 2022
mandatory slide of laundry analogy for pipelining
On Tue, 26 Jul 2022 at 12:41, Lawrence Wobker <ljwobker at gmail.com> wrote:
>> "Pipeline" in the context of networking chips is not a terribly
>> well-defined term. In some chips, you'll have a pipeline that is built
>> from very rigid hardware logic blocks -- the first block does exactly one
>> part of the packet forwarding, then hands the packet (or just the header
>> and metadata) to the second block, which does another portion of the
>> forwarding. You build the pipeline out of as many blocks as you need to
>> solve your particular networking problem, and voila!
> "Pipeline", in the context of networking chips, is not a terribly
> well-defined term! In some chips, you'll have an almost-literal pipeline
> that is built from very rigid hardware logic blocks. The first block does
> exactly one part of the packet forwarding, then hands the packet (or just
> the header and metadata) to the second block, which does another portion of
> the forwarding. You build the pipeline out of as many blocks as you need
> to solve your particular networking problem, and voila!
> The advantages here is that you can make things very fast and power
> efficient, but they aren't all that flexible, and deity help you if you
> ever need to do something in a different order than your pipeline!
> You can also build a "pipeline" out of software functions - write up some
> Python code (because everyone loves Python, right?) where function A calls
> function B and so on. At some level, you've just build a pipeline out of
> different software functions. This is going to be a lot slower (C code
> will be faster but nowhere near as fast as dedicated hardware) but it's WAY
> more flexible. You can more or less dynamically build your "pipeline" on a
> packet-by-packet basis, depending on what features and packet data you're
> dealing with.
> "Microcode" is really just a term we use for something like "really
> optimized and limited instruction sets for packet forwarding". Just like
> an x86 or an ARM has some finite set of instructions that it can execute,
> so do current networking chips. The larger that instruction space is and
> the more combinations of those instructions you can store, the more
> flexible your code is. Of course, you can't make that part of the chip
> bigger without making something else smaller, so there's another tradeoff.
> MOST current chips are really a hybrid/combination of these two extremes.
> You have some set of fixed logic blocks that do exactly One Set Of Things,
> and you have some other logic blocks that can be reconfigured to do A Few
> Different Things. The degree to which the programmable stuff is
> programmable is a major input to how many different features you can do on
> the chip, and at what speeds. Sometimes you can use the same hardware
> block to do multiple things on a packet if you're willing to sacrifice some
> packet rate and/or bandwidth. The constant "law of physics" is that you
> can always do a given function in less power/space/cost if you're willing
> to optimize for that specific thing -- but you're sacrificing flexibility
> to do it. The more flexibility ("programmability") you want to add to a
> chip, the more logic and memory you need to add.
> From a performance standpoint, on current "fast" chips, many (but
> certainly not all) of the "pipelines" are designed to forward one packet
> per clock cycle for "normal" use cases. (Of course we sneaky vendors get
> to decide what is normal and what's not, but that's a separate issue...)
> So if I have a chip that has one pipeline and it's clocked at 1.25Ghz, that
> means that it can forward 1.25 billion packets per second. Note that this
> does NOT mean that I can forward a packet in "a
> one-point-two-five-billionth of a second" -- but it does mean that every
> clock cycle I can start on a new packet and finish another one. The length
> of the pipeline impacts the latency of the chip, although this part of the
> latency is often a rounding error compared to the number of times I have to
> read and write the packet into different memories as it goes through the
> So if this pipeline can do 1.25 billion PPS and I want to be able to
> forward 10BPPS, I can build a chip that has 8 of these pipelines and get my
> performance target that way. I could also build a "pipeline" that
> processes multiple packets per clock, if I have one that does 2
> packets/clock then I only need 4 of said pipelines... and so on and so
> forth. The exact details of how the pipelines are constructed and how much
> parallelism I built INSIDE a pipeline as opposed to replicating pipelines
> is sort of Gooky Implementation Details, but it's a very very important
> part of doing the chip level architecture as those sorts of decisions drive
> lots of Other Important Decisions in the silicon design...
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NANOG