ultradns reachability

Fri Jul 2 14:22:09 UTC 2004

On 2 Jul 2004, at 00:18, Christopher L. Morrow wrote:

> So, I thought of it like this:
> 1) Rodney/Centergate/UltraDNS knows where all their 35000billion 
> copies of
> the 2 .org TLD boxes are, what network pieces they are connected to at
> which bandwidths and the current utilization
> 2) Rodney/Centergate/UltraDNS knows which boxes in each location (there
> could be multiple inside each pod, right?) are running their dns 
> process
> and answering at which rates
> 3) Rodney/Centergate/UltraDNS knows when processes die and locally stop
> pushing requests to said system inside the pod
> 4) Rodney/Centergate/UltraDNS knows when a pod is completely down (no
> systmes responding inside the local pod) so they can stop routing the 
> /24
> from that pod's location
>
> So, Rodney/Centergate/UltraDNS should know almost exactly when they 
> have a
> problem they can term 'critical'... I most probably left out some steps
> above, like wedged proceseses or loss of outbound routing to prefixes
> sending reqeusts. I'm sure Paul/ISC has a fairly complete list of 
> failure
> modes for anycast DNS services.

All the failure modes that ISC has seen with anycast nameserver 
instances can be avoided (for the authoritative DNS service as a whole) 
by including one or more non-anycast nameservers in the NS set.

This leaves the anycast servers providing all the optimisation that 
they are good for (local nameserver in toplogically distant networks; 
distributed DDoS traffic sink; reduced transaction RTT) and provides a 
fall-back in case of effective reachability problems for the anycast 
nameservers.

This is so trivial, I continue to be amazed that PIR hasn't done it.

> The problem then becomes the "Hey, .org is dead!" From where is it 
> dead?
> What pod are you seeing it dead from? Is it routing TO the pod from 
> you?
> FROM the pod to you? The pod itself? Stuck/stale routing information
> somewhere on the path(s)? This is very complex, or seems to be to me :(

With the fix above, the problem becomes "hey, *some* of the nameservers 
for ORG are dead! We should fix that, but since not *all* of them are 
dead, at least ORG still works."

> I think more failure modes will be investigated before that comes :)
> fortunately lots of people are already investigating these, eh?

I don't know about lots, but I know of a few. None of the people I know 
of are using an entire production TLD as their test-bed, however.

Joe