400G forwarding - how does it work?

Mon Jul 25 12:53:03 UTC 2022

Hi All,

I've been trying to understand how forwarding at 400G is possible,
specifically in this example, in relation to the Broadcom J2 chips,
but I don't the mystery is anything specific to them...

According to the Broadcom Jericho2 BCM88690 data sheet it provides
4.8Tbps of traffic processing and supports packet forwarding at 2Bpps.
According to my maths that means it requires packet sizes of 300Bs to
reach line rate across all ports. The data sheet says packet sizes
above 284B, so I guess this is excluding some headers like the
inter-frame gap and CRC (nothing after the PHY/MAC needs to know about
them if the CRC is valid)? As I interpret the data sheet, J2 should
supports chassis with 12x 400Gbps ports at line rate with 284B packets
then.

Jericho2 can be linked to a BCM16K for expanded packet forwarding
tables and lookup processing (i.e. to hold the full global routing
table, in such a case, forwarding lookups are offloaded to the
BCM16K). The BCM16K documentation suggests that it uses TCAM for exact
matching (e.g.,for ACLs) in something called the "Database Array"
(with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
something called the "User Data Array" (with 16M 32b entries?).

A BCM16K supports 16 parallel searches, which means that each of the
12x 400G ports on a Jericho2 could perform an forwarding lookup at
same time. This means that the BCM16K "only" needs to perform
forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and
"only" for packets larger than 284 bytes, because that is the Jericho2
line-rate Pps rate. This means that each of the 16 parallel searches
in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to
reach 400Gbps. This is much more in the realm of feasible, but still
pretty extreme...

1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which
is within the access time of TCAM and SRAM but this needs to include
some computing time too e.g. generating a key for a lookup and passing
the results along the pipeline etc. The BCM16K has a clock speed of
1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second)
and supports an SRAM memory access in a single clock cycle (according
to the data sheet). If one cycle is required for an SRAM lookup, the
BCM16K only has 5 cycles to perform other computation tasks, and the
J2 chip needs to do the various header re-writes and various counter
updates etc., so how is magic this happening?!?

The obvious answer is that it's not magic and my understanding is
fundamentally flawed, so please enlighten me.

Cheers,
James.