[447] in Athena Bugs
Problem on POs
daemon@ATHENA.MIT.EDU (Ron M. Hoffmann)
Fri Jun 17 13:32:18 1988
Date: Fri, 17 Jun 88 13:31:45 EDT
From: Ron M. Hoffmann <hoffmann@BITSY.MIT.EDU>
To: bugs@ATHENA.MIT.EDU
Cc: hoffmann@BITSY.MIT.EDU
I have documented this at least once before but will do it again
as people are still getting abused by this and there have been no
suggestions for a fix.
The problem starts when a number of defunct processes start of linger:
E40-PO# ps ax
PID TT STAT TIME COMMAND
0 ? D 0:00 swapper
1 ? S 0:00 init
2 ? D 0:00 pagedaemon
92 ? S 0:33 /etc/named
103 ? S 0:47 /etc/syslogd
121 ? S 0:09 /etc/update
125 ? I 0:08 /etc/cron
131 ? I 0:00 /etc/inetd
154 ? S 0:10 /usr/lib/sendmail -bd -q30m
283 ? S 0:28 /usr/etc/knetd
285 co I 0:00 - std.9600 console (getty)
443 ? Z 0:00 <defunct>
515 ? Z 0:00 <defunct>
605 p0 Z 0:00 <defunct>
854 ? Z 0:00 <defunct>
948 ? Z 0:00 <defunct>
1117 ? Z 0:00 <defunct>
1174 p0 Z 0:00 <defunct>
1638 p0 Z 0:00 <defunct>
1715 p0 Z 0:00 <defunct>
1717 p0 Z 0:00 <defunct>
1861 p0 Z 0:00 <defunct>
1873 p0 S 0:00 klogind
1874 p0 S 0:02 -csh (csh)
1878 p0 R 0:00 ps ax
E40-PO#
Who do they belong to you ask?
E40-PO# ps axl
F UID PID PPID CP PRI NI ADDR SZ RSS WCHAN STAT TT TIME COMMAND
3 0 0 0 0 -25 0 305 0 0 runout D ? 0:00 swapper
1008001 0 1 0 0 5 0 90e 13 9 proc I ? 0:00 init
1000003 0 2 0 0 -24 0 900 608 0 proc D ? 0:00 pagedaemon
1008001 0 92 1 0 1 0 a48 52 39 selwai S ? 0:33 /etc/named
1008001 0 103 1 0 1 0 b9e 31 22 selwai S ? 0:48 /etc/syslo
1008201 0 121 1 0 15 0 a8e 4 2 u S ? 0:09 /etc/updat
1008201 0 125 1 0 15 0 a66 19 10 u I ? 0:08 /etc/cron
1008001 0 131 1 0 1 0 d32 29 21 selwai I ? 0:00 /etc/inetd
1008001 0 154 1 0 1 0 e9c 67 57 mbutl I ? 0:10 /usr/lib/s
1008001 0 283 1 0 1 0 dcc 38 32 selwai S ? 0:28 /usr/etc/k
1408001 0 285 1 0 3 0 ad0 25 18 cons I co 0:00 - std.9600
1408401 0 443 283 8 27 0 b9e 0 0 Z ? 0:00 <defunct>
1408401 0 515 283 14 28 0 d32 0 0 Z ? 0:00 <defunct>
1408401 0 605 283 11 27 0 163c 0 0 Z p0 0:00 <defunct>
1408401 0 854 283 15 28 0 a48 0 0 Z ? 0:00 <defunct>
1408401 0 948 283 24 31 0 a48 0 0 Z ? 0:00 <defunct>
1408401 0 1117 283 5 26 0 0 0 0 Z ? 0:00 <defunct>
1408401 50 1174 283 17 29 0 163c 0 0 Z p0 0:00 <defunct>
1408401 0 1638 283 9 27 0 163c 0 0 Z p0 0:00 <defunct>
1408401 0 1715 283 10 27 0 163c 0 0 Z p0 0:00 <defunct>
1408401 0 1717 283 9 27 0 163c 0 0 Z p0 0:00 <defunct>
1408401 0 1861 283 18 -5 0 1560 0 0 Z p0 0:00 <defunct>
1008001 0 1873 283 0 1 0 1608 44 15 selwai S p0 0:00 klogind
1408201 0 1874 1873 0 15 0 163c 33 21 u S p0 0:02 -csh (csh)
1008001 0 1883 1874 34 33 0 1560 83 60 R p0 0:00 ps axl
E40-PO#
Aha! The dastardly accusing finger is pointing at Mr. /usr/etc/knetd!
Here's a look at the connection state of the machine around that time:
E40-PO# netstat -a
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp 0 0 E40-PO.MIT.EDU.knetd E40-343A-2.MIT.E.1268 ESTABLISHED
tcp 0 0 E40-PO.MIT.EDU.knetd ARIADNE.MIT.EDU.1125 TIME_WAIT
tcp 0 78 E40-PO.MIT.EDU.knetd PADDINGTON.MIT.E.1021 ESTABLISHED
tcp 0 0 *.knetd *.* LISTEN
tcp 0 0 *.smtp *.* LISTEN
tcp 0 0 *.klogin *.* LISTEN
tcp 0 0 *.write *.* LISTEN
tcp 0 0 *.finger *.* LISTEN
tcp 0 0 *.exec *.* LISTEN
tcp 0 0 *.login *.* LISTEN
tcp 0 0 *.shell *.* LISTEN
tcp 0 0 *.telnet *.* LISTEN
tcp 0 0 *.ftp *.* LISTEN
tcp 0 0 *.nameserv *.* LISTEN
udp 0 0 *.ntalk *.*
udp 0 0 *.talk *.*
udp 0 0 *.biff *.*
udp 0 0 *.tftp *.*
udp 0 0 *.syslog *.*
udp 0 0 *.nameserv *.*
E40-PO#
Looks innocent enough you say? Well, once again this morning around
8:00 am the number of defunct processes was large enough to fill up
the process table to the point where I couldn't even log in as root on
the console port, not to mention the fact that users couldn't retrieve
their mail.
The only solution at that point is to halt the machine and reboot,
carefully.
Could we track this down once and for all? This is really the only
persistent bug which results in denial of service to PO clients, apart
from the `locked mailbox' where lingering TCP connections don't die.
My hunch is the latter could be fixed if the tcp keepalive option were
being properly invoked by the server so that connections would time
out.
-Ron