dns and software, was Re: Reliable Cloud host ?

Owen DeLong owen at delong.com
Wed Feb 29 18:01:12 UTC 2012

On Feb 29, 2012, at 6:18 AM, William Herrin wrote:

> On Wed, Feb 29, 2012 at 7:57 AM, Joe Greco <jgreco at ns.sol.net> wrote:
>>> In message <CAP-guGXK3WQGPLpmnVsnM0xnnU8==4zONK=UWTLkYWuduA6T9Q at mail.gmail.com>,
>>>  William Herrin writes:
>>>> On Tue, Feb 28, 2012 at 4:06 PM, Mark Andrews <marka at isc.org> wrote:
>>>>> DNS TTL works. =A0Applications that don't honour it arn't a indication th=
>>>> at
>>>>> it doesn't work.
>>>> Mark,
>>>> If three people died and the building burned down then the sprinkler
>>>> system didn't work. It may have sprayed water, but it didn't *work*.
>>> Not enough evidence to say if it worked or not.  Sprinkler systems
>>> are designed to handle particular classes of fire, not every fire.
>> It is also worth noting that many fire systems are not intended to
>> put out the fire, but to provide warning and then provide an extended
>> window for people to exit the affected building through use of sprinklers
>> and other measures to slow the spread of the fire.
> Hi Joe,
> The sprinkler system is designed to delay the fire long enough for
> everyone to safely escape. As a secondary objective, it reduces the
> fire damage that occurs while waiting for firefighters to arrive and
> extinguish the fire. If "three people died" then the system failed.
> Perhaps the design was inadequate. Perhaps some age-related issue
> prevented the sprinkler heads from melting. Perhaps someone stacked
> boxes to the ceiling and it blocked the water. Perhaps the water was
> shut off and nobody knew it. Perhaps an initial explosion damaged the
> sprinkler system so it could no longer work effectively. Whatever the
> exact details, that sprinkler system failed.

Bill, you are blaming the sprinkler system for what could, in fact, be not
a failure of the sprinkler system, but, of the 3 humans.

If they were too intoxicated or stoned to react, for example, the sprinkler
system is not to blame. If they were overcome by smoke before the
sprinklers went off, that may be a failure of the smoke detectors, but, it
is not a failure of the sprinklers. If they were killed or rendered unconsious
and/or unresponsive in the preceding explosion you mentioned and did
not die in the subsequent fire, then, that is not a failure in the sprinkler

> Whoever you want to blame, DNS TTL dysfunction at the application
> level is the same way. It's a failed system. With the TTL on an A
> record set to 60 seconds, you can't change the address attached to the
> A record and expect that 60 seconds later no one will continue to
> connect to the old address. Nor 600 seconds later nor 6000 seconds
> later. The "system" for renumbering a service of which the TTL setting
> is a part consistently fails to reliably function in that manner.

Yes, the assumption by developers that gni/ghi is a fire-and-forget
mechanism and that the data received is static is a failure. It is not a
failure of DNS TTL. It is a failure of the application developers that
code that way. Further analysis of the underlying causes of that failure
to properly understand name resolution technology and the environment
in which it operates is left as an exercise for the reader.

The fact that people playing interesting games with DNS TTLs don't
necessarily understand or well document the situation to raise awareness
among application developers could also be argued to be a failure
on the part of those people.

It is not, in either case, a failure of the technology.

One should always call gni/gai in close temporal (and ideally close
in the code as well) proximity to calling connect(). Obviously one
should call these resolver functions prior to calling connect().

Most example code is designed for short-lived non-recovering flows,
so, it's designed along the lines of resolve->(iterate through results
calling connect() for each result untill connect() succeeds)->process->

Examples for persistent connections and/or connections that recover
or re-establish after a failure and/or browsers that stay running for a
long time and connect to the same system again significantly later
are few and far between. As a result, most code doing that ends up
being poorly written.

Further, DNS performance issues in the past have led developers of
such applications to "take matters into their own hands" to try and
improve the performance/behavior of their application in spite of
DNS. This is one of the things that led to many of the TTL ignorant
application-level DNS caches which you are complaining about.

Again, not a failure of DNS technology, but, of the operators of that
technology and the developers that tried to compensate for those
failures. They introduced a cure that is often worse than the disease.


> Regards,
> Bill Herrin
> -- 
> William D. Herrin ................ herrin at dirtside.com  bill at herrin.us
> 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/>
> Falls Church, VA 22042-3004

More information about the NANOG mailing list