dns and software, was Re: Reliable Cloud host ?

Fri Mar 2 20:59:46 UTC 2012

On Mar 2, 2012, at 10:12 AM, William Herrin wrote:

> On Fri, Mar 2, 2012 at 1:03 AM, Owen DeLong <owen at delong.com> wrote:
>> On Mar 1, 2012, at 9:34 PM, William Herrin wrote:
>>> You know, when I wrote 'socket=connect("www.google.com",80,TCP);' I
>>> stopped and thought to myself, "I wonder if I should change that to
>>> 'connectbyname' instead just to make it clear that I'm not replacing
>>> the existing connect() call?" But then I thought, "No, there's a
>>> thousand ways someone determined to misunderstand what I'm saying will
>>> find to misunderstand it. To someone who wants to understand my point,
>>> this is crystal clear."
> 
> "Hyperbole." If I had remembered the word, I could have skipped the
> long description.
> 
>> I'm all for additional library functionality
>> I just don't want conect() to stop working the way it does or for getaddrinfo() to stop
>> working the way it does.
> 
> Good. Let's move on.
> 
> 
> First question: who actually maintains the standard for the C sockets
> API these days? Is it a POSIX standard?
> 

Well, some of it seems to be documented in RFCs, but, I think what you're wanting doesn't require adds to the sockets library, per se. In fact, I think wanting to make it part of that is a mistake. As I said, this should be a
higher level library.

For example, in Perl, you have Socket (and Socket6), but, you also have several other abstraction libraries such as Net::HTTP.

While there's no hierarchical naming scheme for the functions in libc, if you look at the source for any of the open source libc libraries out there, you'll find definite hierarchy.

POSIX certainly controls one standard. The GNU libc maintainers control the standard for the libc that accompanies GCC to the best of my knowledge. I would suggest that is probably the best place to start since I think anything that gains acceptance there will probably filter to the others fairly quickly.

> Next, we have a set of APIs which, with sufficient caution and skill
> (which is rarely the case) it's possible to string together a
> reasonable process which starts with a some kind of name in a text
> string and ends with established communication with a remote server
> for any sort of name and any sort of protocol. These APIs are complete
> but we repeatedly see certain kinds of error committed while using
> them.
> 

Right... Since these are user-errors (at the developer level) I wouldn't try to fix them in the APIs. I would, instead, build more developer proof add-on APIs on top of them.

> Is there a common set of activities an application programmer intends
> to perform 9 times out of 10 when using getaddrinfo+connect? I think
> there is, and it has the following functionality:
> 
> Create a [stream].to one of the hosts satisfying [name] + [service]
> within [timeout] and return a [socket].
> 

Seems reasonable, but ignores UDP. If we're going to do this, I think we should target a more complete solution to include a broader range of probabilities than just the most common TCP connect scenario.

> Does anybody disagree? Here's my reasoning:
> 
> Better than 9 times out of 10 a steam and usually a TCP stream at
> that. Connect also designates a receiver for a connectionless protocol
> like UDP, but its use for that has always been a little peculiar since
> the protocol doesn't actually connect. And indeed, sendto() can
> designate a different receiver for each packet sent through the
> socket.
> 

Most applications using UDP that I have seen use sendto()/recvfrom() et. al. Netflow data would suggest that it's less than 9 out of ten times for TCP, but, yes, I would agree it is the most common scenario.

> Name + Service. If TCP, a hostname and a port.
> 
That would apply to UDP as well. Just the semantics of what you do once you have the filehandle are different. (and it's not really a stream, per se).

> Sometimes you want to start multiple connection attempts in parallel
> or have some not-quire-threaded process implement its own scheduler
> for dealing with multiple connections at once, but that's the
> exception. Usually the only reason for dealing with the connect() in
> non-blocking mode is that you want to implement sensible error recover
> with timeouts.
> 

Agreed.

> And the timeout - the direction that control should be returned to the
> caller no later than X. If it would take more than X to complete, then
> fail instead.
> 

Actually, this is one thing I would like to see added to connect() and that could be done without breaking the existing API.

> 
> 
> Next item: how would this work under the hood?
> 
> Well, you have two tasks: find a list of candidate endpoints from the
> name, and establish a connection to one of them.
> 
> Find the candidates: ask all available name services in parallel
> (hosts, NIS, DNS, etc). Finished when:
> 
> 1. All services have responded negative (failure)
> 
> 2. You have a positive answer and all services which have not yet
> answered are at a lower priority (e.g. hosts answers, so you don't
> need to wait for NIS and DNS).
> 
> 3. You have a positive answer from at least one name service and 1/2
> of the requested time out has expired.
> 
> 4. The full time out has expired (failure).
> 

I think the existing getaddrinfo() does this pretty well already.

I will note that the services you listed only apply to resolving the host name. Don't forget that you might also need to resolve the service to a port number. (An application should be looking up HTTP, not assuming it is 80, for example).

Conveniently, getaddrinfo simultaneously handles both of these lookups.

> Cache the knowledge somewhere along with TTLs (locally defined if the
> name service doesn't explicitly provide a TTL). This may well be the
> first of a series of connection requests for the same host. If cached
> and TTL valid knowledge was known for this name for a particular
> service, don't ask that service again.
> 

I recommend against doing this above the level of getaddrinfo(). Just call getaddrinfo() again each time you need something. If it has cached data, it will return quickly and is cheap. If it doesn't return quickly, it will still work just as quickly as anything else most likely.

If getaddrinfo() on a particular system is not well behaved, we should seek to fix that implementation of getaddrinfo(), not write yet another replacement.

> Also need to let the app tell us to deprioritize a particular result
> later on. Why? Let's say I get an HTTP connection to a host but then
> that connection times out. If the app is managing the address list, it
> can try again to another address for the same name. We're now hiding
> that detail from the app, so we need a callback for the app to tell
> us, "when I try again, avoid giving me this answer because it didn't
> turn out to work."
> 

I would suggest that instead of making this opaque and then complicating
it with these hints when we return, that we return use a mecahism where we
return a pointer to a dynamically allocated result (similar to getaddrinfo) and
if we get called again with a pointer to that structure, we know to delete the
previously connected host from the list we try next time.

When the application is done with the struct, it should free it by calling an
appropriate free function exported by this new API.

> 
> So, now we have a list of addresses with valid TTLs as of the start of
> our connection attempt. Next step: start the connection attempt.
> 
> Pick the "first" address (chosen by whatever the ordering rules are)
> and send the connection request packet and let the OS do its normal
> retry schedule. Wait one second (system or sysctl configurable) or
> until the previous connection request was either accepted or rejected,
> whichever is shorter. If not connected yet, background it, pick the
> next address and send a connection request. Repeat until a one
> connection request has been issued to all possible destination
> addresses for the name.
> 
> Finished when:
> 
> 1. Any of the pending connection requests completes (others are aborted).
> 
> 2. The time out is reached (all pending request aborted).
> 
> Once a connection is established, this should be cached alongside the
> address and its TTL so that next time around that address can be tried
> first.
> 

Seems mostly reasonable. I would consider possibly having some form of inverse exponential backoff on the initial connection attempts. Maybe wait 5 seconds for the first one before trying the second one and waiting 2 seconds, then 1 second if the third one hasn't connected, then bottoming out somewhere around 500ms for the remainder.

> 
> 
>> Since you were hell bent on calling the existing mechanisms broken rather than
>> conceding the point that the current process is not broken, but, could stand some
>> improvements in the library
> 
> I hold that if an architecture encourages a certain implementation
> mistake largely to the exclusion of correct implementations then that
> architecture is in some way broken. That error may be in a particular

I don't believe that the architecture encourages the implementation mistake.

Rather, I think human behavior and our tendency not to seek proper understanding of the theory of operation of various things prior to implementing things which depend on them is more at fault. I suppose that you can argue that the API should be built to avoid that, but, we'll have to agree to disagree on that point. I think that low-level APIs (and this is a low-level API) have to be able to rely on the engineers that use them making the effort to understand the theory of operation. I believe that the fault here is the lack of a standardized higher-level API in some languages.

> component, but it could be that the components themselves are correct.
> There could be in a missing component or the components could strung
> together in a way that doesn't work right. Regardless of the exact
> cause, there is an architecture level mistake which is the root cause
> of the consistently broken implementations.
> 

I suppose by your definition this constitutes a missing component. I don't see it that way. I see it as a complete and functional system for a low-level API. There are high-level APIs available. As you have noted, some better than others. A standardized well-written high-level API would, indeed, be useful. However, that does not make the low-level API broken just because it is common for poorly trained users to make improper use of it. It is common for people using hammers to hit their thumbs. This does not mean that hammers are architecturally broken or that they should be re-engineered to have elaborate thumb-protection mechanisms.

The fact that you can electrocute yourself by sticking a fork into a toaster while it is operating is likewise, not an indication that toasters are architecturally broken.

It is precisely this attitude that has significantly increased the overhead and unnecessary expense of many systems while making product liability lawyers quite wealthy.

Owen