Server Redundancy

Joe Abley jabley at isc.org
Thu Aug 7 16:35:35 UTC 2003



On Thursday, 7 August 2003, at 07:28AM, Rob Pickering wrote:

> Then you've just got your BGP convergence time and unequal load 
> balancing effects to worry about.
>
> Whilst I'm not knocking Paul's solution in an application like running 
> a root NS for which it is perfect, I'm not so sure it's necessarily 
> best for every kind of service load balancing.

We're using the technique Paul used in local clusters with OSPF; the 
convergence time in an OSPF area which contains only a small number of 
server and a couple of routers in a single area is pretty small. 
There's no BGP convergence issue in this application (there's no BGP 
within the server cluster).

We're using another anycast technique in the wide area, using BGP to 
advertise covering supernets for services which are offered 
autonomously in multiple locations. BGP is involved in this one, but we 
are mitigating the potential for flap damage or transient convergence 
loops by offering service from remote nodes to a local community only, 
and not the whole Internet (i.e. the service supernet is offered as a 
peering route, with restricted propagation, and not for global 
transit).

The general approach we're taking with the wide-area, global service 
distribution technique is described here:

   http://www.isc.org/tn/isc-tn-2003-1.html
   http://www.isc.org/tn/isc-tn-2003-1.txt

> I've used both the route hack based and commercial NAT load balancers, 
> and they both have their place.

It's not really that much of a hack; it's just anycast over an IGP 
coupled with routers which can populate the FIB with multiple 
equal-cost routes with different next-hops, with some manner of flow 
hash to keep traffic from a s single session pointing at the same 
server.

> If you are running complex web services (think expensive per server sw 
> licences etc) then the investment in a pair of redundant load 
> balancers for the front end to give more consistent performance under 
> load as well as resilience can look very sane indeed.

I've deployed services behind foundry 
layer-4/layer-7/content/SLB/buzzword-du-jour switches before, and they 
worked very well; from the brief time I spent with them, they seemed 
well-designed and feature rich.

However, the foundries still suffered from the (near) single point of 
failure problem. It only takes one person to mess up the switch config 
whilst modifying a service or adding a new one, or a firmware upgrade 
that goes bad, and you lose all your services at once.

As Paul mentioned, the advantage of using local-scope anycast with an 
IGP to build a cluster is that there are no additional components, and 
hence no additional points of failure.


Joe




More information about the NANOG mailing list