400G forwarding - how does it work?

Sun Aug 7 14:37:29 UTC 2022

Buffering is a near-religious topic across a large swath of the network industry, but here are some opinions of mine:

a LOT of operators/providers need more buffering than you can realistically put directly onto the ASIC die.  Fast chips without external buffers measure capacity in tens of microseconds, which is nowhere near enough for a lot of the market.  We can (and do) argue about exactly where and what network roles can be met by this amount of buffering, but it's absolutely not a large enough part of the market to totally go away from "big" external buffers.
Once you "jump off the cliff" of needing something more than on-chip SRAM, you're in this weird area where nothing exists in the technology space that *really* solves the problem, because you really need access rate and bandwidth more than you need capacity.   HBM is currently the best (or at least the most popular) combination of capacity, power, access rate, and bandwidth... but it's still nowhere near perfect.  A common HBM2 implementation gives you 8GB of buffer space and about 2Tb of raw bandwidth, and a few hundred million IOPS.  (A lot of that gets gobbled up by various overheads....)

These values are a function of two things:
1) memory physics - I don't know enough about how these things are Like Really Actually Built to talk about this part.
2) market forces... the market for this stuff is really GPUs, ML/AI applications, etc.  The networking silicon market is a drop in the ocean compared to the rest of compute, so the specific needs of my router aren't going to ever drive enough volume to get big memory makers to do exactly what **I** want.  I'm at the mercy of what they build for the gigantic players in the rest of the market.  

If you told me that someone had a memory technology that was something like "one-fourth the capacity of HBM, but four times the bandwidth and four times the access rate" I would do backflips and buy a lot of it, because it's a way better fit for the specific performance dimensions I need for A Really Fast Router.  But nothing remotely along these lines exists... so like a lot of other people I just have to order off the menu.   ;-)

--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail.com at nanog.org> On Behalf Of Masataka Ohta
Sent: Sunday, August 7, 2022 5:13 AM
To: nanog at nanog.org
Subject: Re: 400G forwarding - how does it work?

ljwobker at gmail.com wrote:

> Buffer designs are *really* hard in modern high speed chips, and there 
> are always lots and lots of tradeoffs.  The "ideal" answer is an 
> extremely large block of memory that ALL of the forwarding/queueing 
> elements have fair/equal access to... but this physically looks more 
> or less like a full mesh between the memory/buffering subsystem and 
> all the forwarding engines, which becomes really unwieldly 
> (expensive!) from a design standpoint.  The amount of memory you can 
> practically put on the main NPU die is on the order of 20-200 **mega** 
> bytes, where a single stack of HBM memory comes in at 4GB -- it's 
> literally 100x the size.

I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay.

With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%.

But, there are so many router engineers who think, with bloated buffer, packet drop probability can be zero, which is wrong.

For example,

	https://www.broadcom.com/products/ethernet-connectivity/switching/stratadnx/bcm88690
	Jericho2 delivers a complete set of advanced features for
	the most demanding carrier, campus and cloud environments.
	The device supports low power, high bandwidth HBM packet
	memory offering up to 160X more traffic buffering compared
	with on-chip memory, enabling zero-packet-loss in heavily
	congested networks.

					Masataka Ohta