scaling linux-based router hardware recommendations
Jim Shankland
nanog at shankland.org
Tue Jan 27 16:31:09 UTC 2015
On 1/26/15 11:33 PM, Pavel Odintsov wrote:
> Hello!
>
> Looks like somebody want to build Linux soft router!) Nice idea for
> routing 10-30 GBps. I route about 5+ Gbps in Xeon E5-2620v2 with 4
> 10GE cards Intel 82599 and Debian Wheezy 3.2 (but it's really terrible
> kernel, everyone should use modern kernels since 3.16 because "buggy
> linux route cache"). My current processor load on server is about:
> 15%, thus I can route about 15 GE on my Linux server.
>
>
I looked into the promise and limits of this approach pretty intensively
a few years back before abandoning the effort abruptly due to other
constraints. Underscoring what others have said: it's all about pps, not
aggregate throughput. Modern NICs can inject packets at line rate into
the kernel, and distribute them across per-processor queues, etc.
Payloads end up getting DMA-ed from NIC to RAM to NIC. There's really no
reason you shouldn't be able to push 80 Gb/s of traffic, or more,
through these boxes. As for routing protocol performance (BGP
convergence time, ability to handle multiple full tables, etc.): that's
just CPU and RAM.
The part that's hard (as in "can't be fixed without rethinking this
approach") is the per-packet routing overhead: the cost of reading the
packet header, looking up the destination in the routing table,
decrementing the TTL, and enqueueing the packet on the correct outbound
interface. At the time, I was able to convince myself that being able to
do this in 4 us, average, in the Linux kernel, was within reach. That's
not really very much time: you start asking things like "will the entire
routing table fit into the L2 cache?"
4 us to "think about" each packet comes out to 250Kpps per processor;
with 24 processors, it's 6Mpps (assuming zero concurrency/locking
overhead, which might be a little bit of an ... assumption). With
1500-byte packets, 6Mpps is 72 Gb/s of throughput -- not too shabby. But
with 40-byte packets, it's less than 2 Gb/s. Which means that your Xeon
ES-2620v2 will not cope well with a DDoS of 40-byte packets. That's not
necessarily a reason not to use this approach, depending on your
situation; but it's something to be aware of.
I ended up convincing myself that OpenFlow was the right general idea:
marry fast, dumb, and cheap switching hardware with fast, smart, and
cheap generic CPU for the complicated stuff.
My expertise, such as it ever was, is a bit stale at this point, and my
figures might be a little off. But I think the general principle
applies: think about the minimum number of x86 instructions, and the
minimum number of main memory accesses, to inspect a packet header, do a
routing table lookup, and enqueue the packet on an outbound interface. I
can't see that ever getting reduced to the point where a generic server
can handle 40-byte packets at line rate (for that matter, "line rate" is
increasing a lot faster than "speed of generic server" these days).
Jim
More information about the NANOG
mailing list