400G forwarding - how does it work?

ljwobker at gmail.com ljwobker at gmail.com
Wed Jul 27 13:56:55 UTC 2022

The Broadcom KBP -- often called an "external TCAM" is really closer to a completely separate NPU than just an external TCAM.  "Back in the day" we used external TCAMs to store forwarding state (FIB tables, ACL tables, whatever) on devices that were pretty much just a bunch of TCAM memory and an interface for the "main" NPU to ask for a lookup.  Today the modern KBP devices have WAY more functionality, they have lots of different databases and tables available, which can be sliced and diced into different widths and depths.  They can store lots of different kinds of state, from counters to LPM prefixes and ACLs.  At risk of correcting Ohta-san, note that most ACLs are implemented using TCAMs with wildcard/masking support, as opposed to an exact match lookup.  Exact match lookups are generally used for things that do not require masking or wildcard bits: MAC addresses and MPLS label values are the canonical examples here.  

The SRAM memories used in fast networking chips are almost always built such that they provide one lookup per clock, although hardware designers often use multiple banks of these to increase the number of *effective* lookups per clock.  TCAMs are also generally built such that they provide one lookup/result per clock, but again you can stack up multiple devices to increase this.

Many hardware designs also allow for more flexibility in how the various memories are utilized by the software -- almost everyone is familiar with the idea of "I can have a million entries of X bits, or half a million entries of 2*X bits".  If the hardware and software complexity was free, we'd design memories that could be arbitrarily chopped into exactly the sizes we need, but that complexity is Absolutely Not Free.... so we end up picking a few discrete sizes and the software/forwarding code has to figure out how to use those bits efficiently.  And you can bet your life that as soon as you have a memory that can function using either 80b or 160b entries, you will immediately come across a use case that really really needs to use entries of 81b.

FYI: There's nothing particularly magical about 40b memory widths.  When building these chips you can (more or less) pick whatever width of SRAM you want to build, and the memory libraries that you use spit out the corresponding physical design.

Ohta-san correctly mentions that a critical part of the performance analysis is how fast the different parts of the pipeline can talk to each other.  Note that this concept applies whether we're talking about the connection between very small blocks within the ASIC/NPU, or the interface between the NPU and an external KBP/TCAM, or for that matter between multiple NPUs/fabric chips within a system.  At some point you'll always be constrained by whatever the slowest link in the pipeline is, so balancing all that stuff out is Yet One More Thing for the system designer to deal with.


-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail.com at nanog.org> On Behalf Of Masataka Ohta
Sent: Wednesday, July 27, 2022 9:09 AM
To: nanog at nanog.org
Subject: Re: 400G forwarding - how does it work?

James Bensley wrote:

> The BCM16K documentation suggests that it uses TCAM for exact matching 
> (e.g.,for ACLs) in something called the "Database Array"
> (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in 
> something called the "User Data Array" (with 16M 32b entries?).

Which documentation?

According to:


figure 1 and related explanations:

	Database records 40b: 2048k/1024k.
	Table width configurable as 80/160/320/480/640 bits.
	User Data Array for associated data, width configurable as
	32/64/128/256 bits.

means that header extracted by 88690 is analyzed by 16K finally resulting in 40b (a lot shorter than IPv6 addresses, still may be enough for IPv6 backbone to identify sites) information by "database"
lookup, which is, obviously by CAM because 40b is painful for SRAM, converted to "32/64/128/256 bits data".

> 1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which 
> is within the access time of TCAM and SRAM

As high speed TCAM and SRAM should be pipelined, cycle time, which matters, is shorter than access time.

Finally, it should be pointed out that most, if not all, performance figures such as MIPS and Flops are merely guaranteed not to be exceeded.

In this case, if so deep packet inspections by lengthy header for some complicated routing schemes or to satisfy NSA requirements are required, communication speed between 88690 and 16K will be the limitation factor for PPS resulting in a lot less than maximum possible PPS.

						Masataka Ohta

More information about the NANOG mailing list