ISC DHCP server failover

Sat Mar 20 00:10:04 UTC 2010

David W. Hankins wrote:
> On Wed, Mar 17, 2010 at 09:22:06AM -0500, Dan White wrote:
>   
>>   The servers stop balancing their addresses, and one server starts to
>> exhibit 'peer holds all free leases' in its logs, in which case we need to
>> restart the dhcpd process(es) to force a rebalance.
>>     
>
> If restarting one or both dhcpd processes corrects a pool balancing
> problem, then I suspect what you're looking at is a bug where the
> servers would fail to schedule a reconnection if the failover socket
> is lost in a particular way.  Because the protocol also uses a message
> exchange inside the TCP channel to determine if the socket is up
> (rather than just TCP keepalives) this can sometimes happen even
> without a network outage during load spikes or other brief hiccups on
>   
<long explanation snipped>

With all due respect and acknowledgment of the tremendous contributions 
of ISC and you yourself Mr. Hankins, I have to comment that failover in 
isc-dhcp is broken by design because it requires the amount of 
handholding and operator thinking in the event of a failure that you 
explained to us at length is required. Failure needs to be handled 
automatically and without any intervention at all, otherwise you might 
as well not have it and I think most network operators would agree.

I am certainly not prepared to develop proof of concept code or go the 
full route of developing such a server myself, however, I belive firmly 
that a failover implementation in dhcp could be designed as a 
counterpoint to the current implementation that is reliable, simple, 
scalable and requiring no special procedures once a 'break' occurs. The 
method used by isc-dhcpd, I think, creates the problem of the potential 
for unreliable failover because it's not designed for the 'right' 
problem. But there are example implementations - such as vrrp/carp - 
that would form the basis of trustworthy dhcp failover protocol. Your 
key issues are a) broadcast discovery packets, which every listening 
host on the lan segment (such as 1 or more slaves) can easily respond 
to, and b) unicast frames from relay agents and others, which could 
easily be handled by a virtual mac/shared ip address by a group of 
slaves. This means that redundancy of more than 2 hosts is already 
possible. The last pieces are protocol for servers to join and leave the 
pool of hosts serving dhcp, a master election protocol that 
pre-determines the order of slaves to fail over to in order to avoid the 
half-brain syndrome, a sanity checking protocol to ensure the elected 
master is sane and kicking (eg: the slaves all hit the master with, what 
else, dhcp requests), and a well defined group database update protocol 
over the network so that leases hit some fixed storage somewhere, sometime.

Just my $0.02 worth.

Mike-