[18892] in Athena Bugs
Peculiar ares hangs?
daemon@ATHENA.MIT.EDU (John Hawkinson)
Thu Apr 5 10:27:23 2001
Date: Thu, 5 Apr 2001 10:26:23 -0400 (EDT)
Message-Id: <200104051426.f35EQN706220@zorkmid.mit.edu>
To: bugs@mit.edu
From: John Hawkinson <jhawk@MIT.EDU>
Hi.
On my NetBSD 1.5 laptop, which is running the source-sipb version
of zephyr and ares right now, I ocassionaly see weird hangs where zwgc
freezes solid. It took me a while to eventally get around to linking
it against a debugging version of ares.
Anyhow, the hangs are all in recv(). The process gets stuck
tsleep()ing in the kernel with a "netio" WCHAN. I can get zwgc to move
along as normal by wakeup()ing the process and changing the return
value of the system call to, e.g. 1, and then letting the process
continue. Obviously this sort of trickery isn't really optimal.
Anyhow, here's a stack trace and soforth:
(gdb) where
#0 0x481cc36f in recvfrom ()
#1 0x4818d5b9 in recv ()
#2 0x80743de in read_udp_packets (channel=0x808d000, read_fds=0xbfbfd5ec,
now=986478730) at ares_process.c:241
#3 0x8073e5b in ares_process (channel=0x808d000, read_fds=0xbfbfd5ec,
write_fds=0xbfbfd5cc) at ares_process.c:59
#4 0x8052cdb in mux_loop () at mux.c:186
#5 0x804f3dd in main (argc=1, argv=0xbfbfd678) at main.c:307
#6 0x804ab79 in ___start ()
(gdb) list
236 /* Make sure the server has a socket and is selected in read_fds. */
237 server = &channel->servers[i];
238 if (server->udp_socket == -1 || !FD_ISSET(server->udp_socket, read_fds))
239 continue;
240
241 count = recv(server->udp_socket, buf, sizeof(buf), 0);
242 if (count <= 0)
243 handle_error(channel, i, now);
244
245 process_answer(channel, buf, count, i, 0, now);
If I do the above wakeup/return 1 business with a breakpoint on 242,
I get:
(gdb) print count
$1 = -1
(gdb) print server->udp_socket
$2 = 5
(gdb) print read_fds
$3 = (fd_set *) 0xbfbfd5ec
(gdb) print *read_fds
$4 = {fds_bits = {32, 0, 0, 0, 0, 0, 0, 0}}
So it would appear that read_fds is FD_ISSET for the socket,
yet the recv() still blocks.
zorkmid% fstat -vp 4818
USER CMD PID FD MOUNT INUM MODE SZ|DV R/W
jhawk zwgc 4818 wd /usr 976676 drwxr-xr-x 1024 r
jhawk zwgc 4818 0 / 10976 crw--w---- ttyp2 rw
jhawk zwgc 4818 1 / 10976 crw--w---- ttyp2 rw
jhawk zwgc 4818 2 / 10976 crw--w---- ttyp2 rw
jhawk zwgc 4818 3* unix stream c0835cc0 <-> c093fc40
jhawk zwgc 4818 4* internet dgram udp c0863120 *:62004
jhawk zwgc 4818 5* internet dgram udp c0863f00 18.187.1.140:60041 <-> 18.71.0.151:53
zorkmid% netstat -Aa
Active Internet connections (including servers)
PCB Proto Recv-Q Send-Q Local Address Foreign Address State
[...]
c0863f00 udp 0 0 zorkmid.60041 STRAWB.MIT.E.domai
I'm not 100% if netstat queue lengths are a reliable indicator of anything,
though.
Anyhow, I don't see any changes to ares_process.c on the mainline since
rev 1.3 that look pertinent. mux.c in zwgc (which actually handles the
select()-ing) hasn't changed at all.
I suppose one reasonable strategy is to change the
select() timeout from NULL to a few seconds, and try
to find out more about this case that way.
Are there any better suggestions?
I'm hard-pressed to say if this is a library bug (ares), a client bug (zwgc),
or an OS bug (NetBSD). It's not very easy to repeat, but it happens
every once in a while on my laptop, usually if I'm running
zephyr over an extremely spotty network connection, or
sometimes if I suspend and resume while zwgc is running.
--jhawk