[18892] in Athena Bugs

home help back first fref pref prev next nref lref last post

Peculiar ares hangs?

daemon@ATHENA.MIT.EDU (John Hawkinson)
Thu Apr 5 10:27:23 2001

Date: Thu, 5 Apr 2001 10:26:23 -0400 (EDT)
Message-Id: <200104051426.f35EQN706220@zorkmid.mit.edu>
To: bugs@mit.edu
From: John Hawkinson <jhawk@MIT.EDU>


Hi.

    On my NetBSD 1.5 laptop, which is running the source-sipb version
of zephyr and ares right now, I ocassionaly see weird hangs where zwgc
freezes solid. It took me a while to eventally get around to linking
it against a debugging version of ares.

    Anyhow, the hangs are all in recv(). The process gets stuck
tsleep()ing in the kernel with a "netio" WCHAN. I can get zwgc to move
along as normal by wakeup()ing the process and changing the return
value of the system call to, e.g. 1, and then letting the process
continue. Obviously this sort of trickery isn't really optimal.

    
    Anyhow, here's a stack trace and soforth:

(gdb) where
#0  0x481cc36f in recvfrom ()
#1  0x4818d5b9 in recv ()
#2  0x80743de in read_udp_packets (channel=0x808d000, read_fds=0xbfbfd5ec, 
    now=986478730) at ares_process.c:241
#3  0x8073e5b in ares_process (channel=0x808d000, read_fds=0xbfbfd5ec, 
    write_fds=0xbfbfd5cc) at ares_process.c:59
#4  0x8052cdb in mux_loop () at mux.c:186
#5  0x804f3dd in main (argc=1, argv=0xbfbfd678) at main.c:307
#6  0x804ab79 in ___start ()
(gdb) list
236           /* Make sure the server has a socket and is selected in read_fds. */
237           server = &channel->servers[i];
238           if (server->udp_socket == -1 || !FD_ISSET(server->udp_socket, read_fds))
239             continue;
240     
241           count = recv(server->udp_socket, buf, sizeof(buf), 0);
242           if (count <= 0)
243             handle_error(channel, i, now);
244     
245           process_answer(channel, buf, count, i, 0, now);

    If I do the above wakeup/return 1 business with a breakpoint on 242,
I get:

(gdb) print count
$1 = -1
(gdb) print server->udp_socket
$2 = 5
(gdb) print read_fds
$3 = (fd_set *) 0xbfbfd5ec
(gdb) print *read_fds
$4 = {fds_bits = {32, 0, 0, 0, 0, 0, 0, 0}}

So it would appear that read_fds is FD_ISSET for the socket,
yet the recv() still blocks.

zorkmid% fstat -vp 4818
USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV R/W
jhawk    zwgc        4818   wd /usr     976676 drwxr-xr-x    1024 r 
jhawk    zwgc        4818    0 /         10976 crw--w----   ttyp2 rw
jhawk    zwgc        4818    1 /         10976 crw--w----   ttyp2 rw
jhawk    zwgc        4818    2 /         10976 crw--w----   ttyp2 rw
jhawk    zwgc        4818    3* unix stream c0835cc0 <-> c093fc40
jhawk    zwgc        4818    4* internet dgram udp c0863120 *:62004
jhawk    zwgc        4818    5* internet dgram udp c0863f00 18.187.1.140:60041 <-> 18.71.0.151:53

zorkmid% netstat -Aa
Active Internet connections (including servers)
PCB      Proto Recv-Q Send-Q Local Address         Foreign Address       State
[...]
c0863f00 udp        0      0  zorkmid.60041      STRAWB.MIT.E.domai


I'm not 100% if netstat queue lengths are a reliable indicator of anything,
though.

Anyhow, I don't see any changes to ares_process.c on the mainline since
rev 1.3 that look pertinent. mux.c in zwgc (which actually handles the
select()-ing) hasn't changed at all.

I suppose one reasonable strategy is to change the
select() timeout from NULL to a few seconds, and try
to find out more about this case that way.

Are there any better suggestions?

I'm hard-pressed to say if this is a library bug (ares), a client bug (zwgc),
or an OS bug (NetBSD). It's not very easy to repeat, but it happens
every once in a while on my laptop, usually if I'm running
zephyr over an extremely spotty network connection, or
sometimes if I suspend and resume while zwgc is running.

--jhawk

home help back first fref pref prev next nref lref last post