scaling linux-based router hardware recommendations

Tue Jan 27 16:31:09 UTC 2015

On 1/26/15 11:33 PM, Pavel Odintsov wrote:
> Hello!
>
> Looks like somebody want to build Linux soft router!) Nice idea for
> routing 10-30 GBps. I route about 5+ Gbps in Xeon E5-2620v2 with 4
> 10GE cards Intel 82599 and Debian Wheezy 3.2 (but it's really terrible
> kernel, everyone should use modern kernels since 3.16 because "buggy
> linux route cache"). My current processor load on server is about:
> 15%, thus I can route about 15 GE on my Linux server.
>
>
I looked into the promise and limits of this approach pretty intensively 
a few years back before abandoning the effort abruptly due to other 
constraints. Underscoring what others have said: it's all about pps, not 
aggregate throughput. Modern NICs can inject packets at line rate into 
the kernel, and distribute them across per-processor queues, etc. 
Payloads end up getting DMA-ed from NIC to RAM to NIC. There's really no 
reason you shouldn't be able to push 80 Gb/s of traffic, or more, 
through these boxes. As for routing protocol performance (BGP 
convergence time, ability to handle  multiple full tables, etc.): that's 
just CPU and RAM.

The part that's hard (as in "can't be fixed without rethinking this 
approach") is the per-packet routing overhead: the cost of reading the 
packet header, looking up the destination in the routing table, 
decrementing the TTL, and enqueueing the packet on the correct outbound 
interface. At the time, I was able to convince myself that being able to 
do this in 4 us, average, in the Linux kernel, was within reach. That's 
not really very much time: you start asking things like "will the entire 
routing table fit into the L2 cache?"

4 us to "think about" each packet comes out to 250Kpps per processor; 
with 24 processors, it's 6Mpps (assuming zero concurrency/locking 
overhead, which might be a little bit of an ... assumption). With 
1500-byte packets, 6Mpps is 72 Gb/s of throughput -- not too shabby. But 
with 40-byte packets, it's less than 2 Gb/s. Which means that your Xeon 
ES-2620v2 will not cope well with a DDoS of 40-byte packets. That's not 
necessarily a reason not to use this approach, depending on your 
situation; but it's something to be aware of.

I ended up convincing myself that OpenFlow was the right general idea: 
marry fast, dumb, and cheap switching hardware with fast, smart, and 
cheap generic CPU for the complicated stuff.

My expertise, such as it ever was, is a bit stale at this point, and my 
figures might be a little off. But I think the general principle 
applies: think about the minimum number of x86 instructions, and the 
minimum number of main memory accesses, to inspect a packet header, do a 
routing table lookup, and enqueue the packet on an outbound interface. I 
can't see that ever getting reduced to the point where a generic server 
can handle 40-byte packets at line rate (for that matter, "line rate" is 
increasing a lot faster than "speed of generic server" these days).

Jim