[13] in linux-net channel archive

home help back first fref pref prev next nref lref last post

1.1.73, UAR, and packet lossage - UDP done the dirty deed!

daemon@ATHENA.MIT.EDU (Dave Platt)
Thu Dec 22 23:08:51 1994

From: dplatt@3do.com (Dave Platt)
Date: Thu, 22 Dec 1994 19:21:59 PST
To: linux-atalk@netspace.students.brown.edu
Cc: linux-net@vger.rutgers.edu, Alan.Cox@linux.org, djh@cs.mu.oz.au

More debugging, and some reasonably good news.  I now understand why CAP
doesn't work terribly well on 1.1.73 if UAR is used as the interface.
I have a reasonable workaround for the problem.  I'm not sure whether
this workaround is the "right" way to fix the problem, or whether a
change to the network layer is appropriate.

The problem lies in the fact that UAR talks to its clients (local CAP
processes) via UDP datagrams sent to the loopback address.  Under some
circumstances, UAR may attempt to send a datagram to a UDP port
address that's invalid.  This can occur if an existing client
disconnects unexpectely.  It occurs quite frequently when UAR receives a
broadcast DDP datagram for one of the special adminstrative DDP sockets
(RTMP, ZIP, etc.).

If UAR tries to send a datagram to a non-bound UDP port via the loopback
interface, an error condition is propagated back from the UDP datagram
receiver via ICMP.  The ICMP error (ICMP_PORT_UNREACHABLE) is translated
into an ECONNREFUSED Unix error code, and the error code is stuck into
the UAR datagram socket structure.

The error is not, however, returned as a result of the sendto() call
which sent the undeliverable datagram.  I'm not sure, from looking at
the code, whether the error could can always be expected to have been
delivered by the time the sendto() call is about to return... I _think_
it is, for datagrams sent via the loopback device, but I'm not entirely
certain.

In any case... the error condition is delivered back to UAR on the NEXT
recvfrom() or sendto() call it makes on the datagram port.  If the next
such call is a recvfrom(), the call fails with an ECONNREFUSED
condition, the readcap() routine returns, the next select() call awakens
immediately, and the next call to readcap() returns whatever datagram is
available on the socket.  Annoying, but benign.

However, if UAR tries to send another datagram on the socket BEFORE its
next recvfrom(), bad things happen - THIS sendto() call returns with an
ECONNREFUSED status, and the datagram is dropped on the floor.  If
there's enough network activity, and enough dropped datagrams, the odds
of an AUFS session staying alive drop to near-zero.  This can also cause
the atis process to fail to receive NBP datagrams, which can cause CAP
processes to fail to show up on the Chooser on your Mac.

The workaround:  modify UAR to cope with the problem.  In cap.c, keep a
retry counter;  if sendto() returns with an error condition, simply
retry the operation a few times.  I've set the retry limit to 3, and
haven't seen any dropped packet require more than 1 retry.

The possible fix:  modify the UDP sendto() routine to check for the
occurance of an error on the socket _after_ it has tried to transmit the
packet.  If it finds an error condition, return the error code as the
result of this sendto() and clear the error condition.  This should work
for loopback datagrams _if_ the error code is always posted into the
socket structure during the packet dispatch operation - I'll have to
check the code to see if this is true.  If it works, this is probably a
somewhat cleaner solution than hacking UAR.


-- 
Dave Platt    dplatt@3do.com
      USNAIL: The 3DO Company, Systems Software group
              600 Galveston Drive
              Redwood City, CA  94063

home help back first fref pref prev next nref lref last post