NANOG 40 agenda posted

Colm MacCarthaigh colm at
Mon Jun 4 07:53:42 UTC 2007

On Mon, Jun 04, 2007 at 07:29:03AM +0000, Paul Vixie wrote:
> > If you're load-balancing N nodes, and 1 node dies, the distribution hash
> > is re-calced and TCP sessions to all N are terminated simultaneously. 
> i could just say that since i'm serving mostly UDP i don't care about this,
> but then i wouldn't have a chance to say that paying the complexity and bug
> and training cost of an extra in-path powered box 24x365.24 doesn't weigh
> well against the failure rate of the load balanced servers.  somebody could
> drop an anvil on one of my servers twice a day (so, 730 times per year) and
> i would still come out ahead, given that most TCP traffic comes from web
> browsers and many users will click "Reload" before giving up.

It depends on the length of those TCP sockets. If you were
load-balancing the increasingly common video-over-http, it would be very
unacceptable. You also ignore the "thundering herd" problem that arises
when you suddenly get all of your active clients re-requesting in a very
short time-window like that. 

If I have 1000 active flows that last 10 seconds each, I can expect a
peak rate of about 200 new flows per second. Kill them all in one go and
I can expect a peak rate of 5 times that. That's a significant
difference to plan for, and very different from the load you expect
after an extended outage or initial switch on. This problem also gets
increasingly worse the longer the TCP sockets live.

>  then there's CEF which i think keeps existing flows stable even
>  through an OSPF recalc.

No CEF table I've used does that. Also, if you restrict yourself to CEF,
you have to accept a decrease in the ammount of nodes you can balance Vs
something like quagga on *nix. The limits are anywhere from just 6 ECMP
routes to 32 (though of course you could do staggered load-balancing
using multiple CEF devices). I'm open to correction on the 32, but it's
the highest I've yet come accross. 

The routes get distributed accross the slots of the CEF table as evenly
as possible, but when they dissappear the hashing completely changes (at
least it does for me operationally, and if I use "show ip cef

Interestingly, there is a CEF table state that /could/ enable this
functionality, the "punt" state promises to have an unswitchable packet
get punted out of the CEF table and fall back to higher-level software
switching. If the CEF slots occupied by a now-down node could be forced
into the punt state then only traffic toward that node would be
affected.  But despite questions to Cisco dev teams and much
experimentation, I can't see a reliable way to get a CEF table entry
into the punt state (unlike say the "glean" state, which isn't good

>  finally, there's the fact that we see less than one server failure
>  per month among the 100 or so servers we've deployed behind OSPF
>  ECMP.

Failure rates can and should be low indeed, but that's not where
I see the primary utility of high-availability load-balancers. If
I have 20 web-servers in a load-balanced cluster and I need to 
upgrade them to the latest version of Apache for security reasons,
I want to do it one by one without losing a single HTTP session. 

This *is* possible with many load-balancers (plug: Including Apache's
own load-balancing proxy), but with OSPF I'm forced to drop *all*
sessions to the cluster 20 times (or yes I could do 10 nodes at a time,
but you get the picture).

I *like* OSPF ECMP load-balancing, it's *great*, and I use it in
production, even load-balancing a tonne of https traffic, but in my
opinion you are over-stating its abilities. It is not close to the
capabilities of a good intelligent load-balancer. It is however
extremely cost-effective and good enough for a lot of usage, as long as
it's taken with some operational and engineering considerations.

Colm MacCárthaigh                        Public Key: colm+pgp at

More information about the NANOG mailing list