[2776] in Kerberos-V5-bugs
pending/463: Kerberos rshd deadlock
daemon@ATHENA.MIT.EDU (raeburn@cygnus.com)
Mon Aug 25 09:33:26 1997
Resent-From: gnats@rt-11.MIT.EDU (GNATS Management)
Resent-To: gnats-admin@rt-11.MIT.EDU
Resent-Reply-To: krb5-bugs@MIT.EDU, raeburn@cygnus.com
Date: Mon, 25 Aug 1997 09:31:09 -0400 (EDT)
From: raeburn@cygnus.com
Reply-To: raeburn@cygnus.com
To: bugs@cygnus.com
Cc: krb5-bugs@MIT.EDU
>Number: 463
>Category: pending
>Synopsis: kshd hangs with write deadlock
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: gnats-admin
>State: open
>Class: sw-bug
>Submitter-Id: unknown
>Arrival-Date: Mon Aug 25 09:33:01 EDT 1997
>Last-Modified:
>Originator: Ken Raeburn
>Organization:
Cygnus Solutions
>Release: kerbnet-1.2
>Environment:
System: NetBSD kr-pc.cygnus.com 1.2B NetBSD 1.2B (RAEBURN) #1: Sun Nov 17 01:30:47 EST 1996 root@kr-pc.cygnus.com:/h/NetBSD-build/sys/arch/i386/compile/RAEBURN i386
>Description:
[This was with KerbNet 1.2, but I expect the MIT code suffers the same
lossage. Note I'm sending to both bug addresses. MIT folks, this
should be category "krb5-appl", but that's not on the KerbNet category
list.]
This command (rsh from osf1 to linux) hangs:
(sleep 15 ; cat /vmunix) \
| rsh maneki-neko cat /vmlinuz \
| (sleep 30 ; cat > /dev/null)
The cat process (on the Linux box) stops in "pipe_wait" state. The
kshd process does too. According to netstat, a large amount of data
is in the receive queue; none is in the send queue, or the receive
queue of the client. Even when the client side finishes the 30-second
sleep and starts reading, the server side does not recover.
My guess: kshd got data from the net, found that the pipe to the child
was available, and tried writing to it, but since the child wasn't
reading, less pipe buffer space was available than was needed, so kshd
blocked waiting for the child to read. Then the child blocked because
kshd wasn't reading its output.
I first ran into this with "rsync", running in "push" mode; it keeps
locking up. The server side is probably starting to receive file
contents while still sending checksum data for the hierarchy. Come to
think of it, the perl script I used to use instead of rsync, which did
"rsh host tar -c -f - --files-from - < list-o-files" may have
triggered this bug a few times too. I should see if rdist hangs
too....
All other cases of writes in rsh and kshd ought to be checked to make
sure they can't block the flow of data in another direction.
>How-To-Repeat:
Run
(sleep 15 ; cat /somebigfile) \
| rsh somehost cat /somebigfile \
| (sleep 30 ; cat > /dev/null)
between two machines on a fast network. (Fast enough that the various
kernel buffers can be filled in the specified delays.)
>Fix:
IMNSHO kshd (and rsh) should be using non-blocking I/O for any writes.
If the child isn't reading, let the pipe fill, stop reading from the
net, let that buffer fill, and let the kernel throttle the TCP
transmission. If a deadlock still results, *then* it's an application
bug.
>Audit-Trail:
>Unformatted: