Broadcom vs Mellanox based platforms

Mon Jun 4 08:33:32 UTC 2018

Hi Kim,

I'll share key learnings about since I started to work on high speed 
software networking in 2006, when everyone was laughing at me becaused I 
claimed to achieve 10Gbps networking with a CPU.

CPU is less important than memory/QPI
On x86 memory subsytem include things like Cache Boxes, Home Agent, DRAM 
controllers... Home Agent is reponsible to know on which CPU node is a 
cacheline. So it can become a centralized bottleneck.... DRAM 
controllers have a queue of pending DRAM requests (instruction pipeline, 
data prefetch, data...). QPI routing may also severely impact 
performance. I remember using a 4 socket system that was half the 
performance of a 2 socket system because of either bad QPI routing 
programing by the BIOS or a hardware issue.
An order of magnitude to keep in mind is that at 100Gbps, each 64-byte 
packet and each associated 64-byte used metadata cacheline is consuming 
roughly a full DRAM channel. As an example and not counting application 
data to be leveraged (FIB, DNS database...) a 100Gbps DPDK bridging 
application requires 3 memory channels per port (to reach line rate if 
the IO allows it)... There is a lot more to say but I let you do your 
own research ;-)
BTW, why would you want to do 100GBps line rate (or very close to it)? 
To ensure that each node has the capacity to resist a DDoS attack 
powered by DPDK/ODP/native "applications".

PCI is your ennemy (or not that a good friend)
PCI chipset behavior is complex. The typical payload on x86 is 256bytes. 
So I assumed that using a 1KB max payload to support the average 670 
byte internet packet size would give better results... But no, early DMA 
transaction acknowledgement is disabled if payload not 256 so it dropped 
performance significantly.
You may have an embedded switch on the NIC. So you think that offloading 
will give you a benefit. Yes at low speed but you can't build a 50Gbps 
service chain because most of the NIC are on PCI x8 Gen3 slots which is 
limited to 50Gbps BW.
So the conclusion is: don't try to understand those limits, create a 
testbed that really mimics the target "size" and topology of your use 
case and measure.

Don't do tests at 10Gbps if your target is 100Gbps.
Starting at 50Gbps you will be bumping on PCI DMA transaction rate 
barrier. Unless you have a smart IO model (multiple packets per DMA 
transaction - see Netcope for instance) supported in zero-copy by the 
SDK architecture you won't reach line rate or be able to have an 
application (zero-copy of data or metadata reduction can save a DRAM 
channel for application at this "speed"). I think (but not sure) you can 
squeeze two packets in a buffer with Mellanox cards: that can be 
instrumental in reaching 50Gbps line rate but I don't know if DPDK 
supports this feature.

Don't do pps at the switch level if your target is fast VM application 
behavior.
Measuring that a software switch can do 10Gbps line rate with 64 byte 
packets does not help at all to predict TCP application performance in a 
VM. Factors such as GRO/GSO support are more important as limiting 
factor is TCP window opening.
I measured web traffic over IPSec links between VMs. The key performance 
factor was latency of the switching/IPsec combo: if latency is above a 
certain level, TCP window of the endpoints does not open and the 
in-between software switches become under-utilized.

My vision is that if you use a hardware specific SDK to build your 
hardware specific application, you will get the best of the hardware. 
The gains can range from 30% to 100% depending on HW, so it is not 
negligible (you may have to prove this assertion ;-). One major reason 
being the ability to use the exact sotfware metadata which may become a 
single cache line or even no software metadata at all as you could 
leverage the hardware descriptor directly. The other reason is to 
leverage the native IO model for the device which DPDK may not support. 
The price to pay is hardware or vendor dependence.

FF

PS1: You may want to clarify your search: you haven't stated if your 
interest is L2 switch or L3 switch, if you consider baremetal switching, 
container or VM switching.
If you want L3 then you probably want to focus on VPP, Contrail or Snabb 
rather than the low level packet io frameworks. With latest Intel AVF 
technology, DPDK is almost irrelevant for VPP and actually slows things 
down with the same hardware (Intel XL 710 card)
AAdditionally, the kernel community is working on AF_XDP which may be 
relevant for your case.

PS2: I am not sure NANOG is the best list to discuss the technical 
details you want. That said, it may be the best place to discuss the use 
cases or realistic testbed setup.

On 04.06.2018 07:41, Kasper Adel wrote:
> Hello
> 
> I’m asked to evaluate switching platforms that has different forwarding
> chips but the same OS.
> 
> Assuming these vendors give the same SDK and similar 
> documentation/support,
> then what would be comparison points to consider, other than the 
> obvious
> (price, features, bps, pps).
> 
> I’m thinking, how do i validate their claims about capability to do
> leaf/spine arch, ToR/Gateways, telemetry, serviceability, facilities to
> troubleshoot packet drops or FIB programming misses, hidden tools...etc
> 
> It would be great if anyonw can give some thoughts around it, specially 
> if
> you have tried one or both.
> 
> Thanks
> Kim