[150744] in North American Network Operators' Group


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Re: dns and software, was Re: Reliable Cloud host ?

daemon@ATHENA.MIT.EDU (Owen DeLong)
Fri Mar 2 16:02:38 2012

From: Owen DeLong <owen@delong.com>
In-Reply-To: <CAP-guGVNu9kjY-8SRv=Ag01hJVPs4=i6gh4q2P1W9rczyktXbw@mail.gmail.com>
Date: Fri, 2 Mar 2012 12:59:46 -0800
To: William Herrin <bill@herrin.us>
Cc: Nanog <nanog@nanog.org>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org


On Mar 2, 2012, at 10:12 AM, William Herrin wrote:

> On Fri, Mar 2, 2012 at 1:03 AM, Owen DeLong <owen@delong.com> wrote:
>> On Mar 1, 2012, at 9:34 PM, William Herrin wrote:
>>> You know, when I wrote 'socket=3Dconnect("www.google.com",80,TCP);' =
I
>>> stopped and thought to myself, "I wonder if I should change that to
>>> 'connectbyname' instead just to make it clear that I'm not replacing
>>> the existing connect() call?" But then I thought, "No, there's a
>>> thousand ways someone determined to misunderstand what I'm saying =
will
>>> find to misunderstand it. To someone who wants to understand my =
point,
>>> this is crystal clear."
>=20
> "Hyperbole." If I had remembered the word, I could have skipped the
> long description.
>=20
>> I'm all for additional library functionality
>> I just don't want conect() to stop working the way it does or for =
getaddrinfo() to stop
>> working the way it does.
>=20
> Good. Let's move on.
>=20
>=20
> First question: who actually maintains the standard for the C sockets
> API these days? Is it a POSIX standard?
>=20

Well, some of it seems to be documented in RFCs, but, I think what =
you're wanting doesn't require adds to the sockets library, per se. In =
fact, I think wanting to make it part of that is a mistake. As I said, =
this should be a
higher level library.

For example, in Perl, you have Socket (and Socket6), but, you also have =
several other abstraction libraries such as Net::HTTP.

While there's no hierarchical naming scheme for the functions in libc, =
if you look at the source for any of the open source libc libraries out =
there, you'll find definite hierarchy.

POSIX certainly controls one standard. The GNU libc maintainers control =
the standard for the libc that accompanies GCC to the best of my =
knowledge. I would suggest that is probably the best place to start =
since I think anything that gains acceptance there will probably filter =
to the others fairly quickly.

> Next, we have a set of APIs which, with sufficient caution and skill
> (which is rarely the case) it's possible to string together a
> reasonable process which starts with a some kind of name in a text
> string and ends with established communication with a remote server
> for any sort of name and any sort of protocol. These APIs are complete
> but we repeatedly see certain kinds of error committed while using
> them.
>=20

Right... Since these are user-errors (at the developer level) I wouldn't =
try to fix them in the APIs. I would, instead, build more developer =
proof add-on APIs on top of them.

> Is there a common set of activities an application programmer intends
> to perform 9 times out of 10 when using getaddrinfo+connect? I think
> there is, and it has the following functionality:
>=20
> Create a [stream].to one of the hosts satisfying [name] + [service]
> within [timeout] and return a [socket].
>=20

Seems reasonable, but ignores UDP. If we're going to do this, I think we =
should target a more complete solution to include a broader range of =
probabilities than just the most common TCP connect scenario.

> Does anybody disagree? Here's my reasoning:
>=20
> Better than 9 times out of 10 a steam and usually a TCP stream at
> that. Connect also designates a receiver for a connectionless protocol
> like UDP, but its use for that has always been a little peculiar since
> the protocol doesn't actually connect. And indeed, sendto() can
> designate a different receiver for each packet sent through the
> socket.
>=20

Most applications using UDP that I have seen use sendto()/recvfrom() et. =
al. Netflow data would suggest that it's less than 9 out of ten times =
for TCP, but, yes, I would agree it is the most common scenario.

> Name + Service. If TCP, a hostname and a port.
>=20
That would apply to UDP as well. Just the semantics of what you do once =
you have the filehandle are different. (and it's not really a stream, =
per se).

> Sometimes you want to start multiple connection attempts in parallel
> or have some not-quire-threaded process implement its own scheduler
> for dealing with multiple connections at once, but that's the
> exception. Usually the only reason for dealing with the connect() in
> non-blocking mode is that you want to implement sensible error recover
> with timeouts.
>=20

Agreed.

> And the timeout - the direction that control should be returned to the
> caller no later than X. If it would take more than X to complete, then
> fail instead.
>=20

Actually, this is one thing I would like to see added to connect() and =
that could be done without breaking the existing API.

>=20
>=20
> Next item: how would this work under the hood?
>=20
> Well, you have two tasks: find a list of candidate endpoints from the
> name, and establish a connection to one of them.
>=20
> Find the candidates: ask all available name services in parallel
> (hosts, NIS, DNS, etc). Finished when:
>=20
> 1. All services have responded negative (failure)
>=20
> 2. You have a positive answer and all services which have not yet
> answered are at a lower priority (e.g. hosts answers, so you don't
> need to wait for NIS and DNS).
>=20
> 3. You have a positive answer from at least one name service and 1/2
> of the requested time out has expired.
>=20
> 4. The full time out has expired (failure).
>=20

I think the existing getaddrinfo() does this pretty well already.

I will note that the services you listed only apply to resolving the =
host name. Don't forget that you might also need to resolve the service =
to a port number. (An application should be looking up HTTP, not =
assuming it is 80, for example).

Conveniently, getaddrinfo simultaneously handles both of these lookups.

> Cache the knowledge somewhere along with TTLs (locally defined if the
> name service doesn't explicitly provide a TTL). This may well be the
> first of a series of connection requests for the same host. If cached
> and TTL valid knowledge was known for this name for a particular
> service, don't ask that service again.
>=20

I recommend against doing this above the level of getaddrinfo(). Just =
call getaddrinfo() again each time you need something. If it has cached =
data, it will return quickly and is cheap. If it doesn't return quickly, =
it will still work just as quickly as anything else most likely.

If getaddrinfo() on a particular system is not well behaved, we should =
seek to fix that implementation of getaddrinfo(), not write yet another =
replacement.

> Also need to let the app tell us to deprioritize a particular result
> later on. Why? Let's say I get an HTTP connection to a host but then
> that connection times out. If the app is managing the address list, it
> can try again to another address for the same name. We're now hiding
> that detail from the app, so we need a callback for the app to tell
> us, "when I try again, avoid giving me this answer because it didn't
> turn out to work."
>=20

I would suggest that instead of making this opaque and then complicating
it with these hints when we return, that we return use a mecahism where =
we
return a pointer to a dynamically allocated result (similar to =
getaddrinfo) and
if we get called again with a pointer to that structure, we know to =
delete the
previously connected host from the list we try next time.

When the application is done with the struct, it should free it by =
calling an
appropriate free function exported by this new API.

>=20
> So, now we have a list of addresses with valid TTLs as of the start of
> our connection attempt. Next step: start the connection attempt.
>=20
> Pick the "first" address (chosen by whatever the ordering rules are)
> and send the connection request packet and let the OS do its normal
> retry schedule. Wait one second (system or sysctl configurable) or
> until the previous connection request was either accepted or rejected,
> whichever is shorter. If not connected yet, background it, pick the
> next address and send a connection request. Repeat until a one
> connection request has been issued to all possible destination
> addresses for the name.
>=20
> Finished when:
>=20
> 1. Any of the pending connection requests completes (others are =
aborted).
>=20
> 2. The time out is reached (all pending request aborted).
>=20
> Once a connection is established, this should be cached alongside the
> address and its TTL so that next time around that address can be tried
> first.
>=20

Seems mostly reasonable. I would consider possibly having some form of =
inverse exponential backoff on the initial connection attempts. Maybe =
wait 5 seconds for the first one before trying the second one and =
waiting 2 seconds, then 1 second if the third one hasn't connected, then =
bottoming out somewhere around 500ms for the remainder.

>=20
>=20
>> Since you were hell bent on calling the existing mechanisms broken =
rather than
>> conceding the point that the current process is not broken, but, =
could stand some
>> improvements in the library
>=20
> I hold that if an architecture encourages a certain implementation
> mistake largely to the exclusion of correct implementations then that
> architecture is in some way broken. That error may be in a particular

I don't believe that the architecture encourages the implementation =
mistake.

Rather, I think human behavior and our tendency not to seek proper =
understanding of the theory of operation of various things prior to =
implementing things which depend on them is more at fault. I suppose =
that you can argue that the API should be built to avoid that, but, =
we'll have to agree to disagree on that point. I think that low-level =
APIs (and this is a low-level API) have to be able to rely on the =
engineers that use them making the effort to understand the theory of =
operation. I believe that the fault here is the lack of a standardized =
higher-level API in some languages.

> component, but it could be that the components themselves are correct.
> There could be in a missing component or the components could strung
> together in a way that doesn't work right. Regardless of the exact
> cause, there is an architecture level mistake which is the root cause
> of the consistently broken implementations.
>=20

I suppose by your definition this constitutes a missing component. I =
don't see it that way. I see it as a complete and functional system for =
a low-level API. There are high-level APIs available. As you have noted, =
some better than others. A standardized well-written high-level API =
would, indeed, be useful. However, that does not make the low-level API =
broken just because it is common for poorly trained users to make =
improper use of it. It is common for people using hammers to hit their =
thumbs. This does not mean that hammers are architecturally broken or =
that they should be re-engineered to have elaborate thumb-protection =
mechanisms.

The fact that you can electrocute yourself by sticking a fork into a =
toaster while it is operating is likewise, not an indication that =
toasters are architecturally broken.

It is precisely this attitude that has significantly increased the =
overhead and unnecessary expense of many systems while making product =
liability lawyers quite wealthy.

Owen

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[150744] in North American Network Operators' Group

Re: dns and software, was Re: Reliable Cloud host ?

daemon@ATHENA.MIT.EDU (Owen DeLong)Fri Mar 2 16:02:38 2012

daemon@ATHENA.MIT.EDU (Owen DeLong)
Fri Mar 2 16:02:38 2012