[447] in Athena Bugs

home help back first fref pref prev next nref lref last post

Problem on POs

daemon@ATHENA.MIT.EDU (Ron M. Hoffmann)
Fri Jun 17 13:32:18 1988

Date: Fri, 17 Jun 88 13:31:45 EDT
From: Ron M. Hoffmann <hoffmann@BITSY.MIT.EDU>
To: bugs@ATHENA.MIT.EDU
Cc: hoffmann@BITSY.MIT.EDU
I have documented this at least once before but will do it again
as people are still getting abused by this and there have been no
suggestions for a fix.

The problem starts when a number of defunct processes start of linger:

E40-PO# ps ax
  PID TT STAT  TIME COMMAND
    0 ?  D     0:00 swapper
    1 ?  S     0:00 init
    2 ?  D     0:00 pagedaemon
   92 ?  S     0:33 /etc/named
  103 ?  S     0:47 /etc/syslogd
  121 ?  S     0:09 /etc/update
  125 ?  I     0:08 /etc/cron
  131 ?  I     0:00 /etc/inetd
  154 ?  S     0:10 /usr/lib/sendmail -bd -q30m
  283 ?  S     0:28 /usr/etc/knetd
  285 co I     0:00 - std.9600 console (getty)
  443 ?  Z     0:00 <defunct>
  515 ?  Z     0:00 <defunct>
  605 p0 Z     0:00 <defunct>
  854 ?  Z     0:00 <defunct>
  948 ?  Z     0:00 <defunct>
 1117 ?  Z     0:00 <defunct>
 1174 p0 Z     0:00 <defunct>
 1638 p0 Z     0:00 <defunct>
 1715 p0 Z     0:00 <defunct>
 1717 p0 Z     0:00 <defunct>
 1861 p0 Z     0:00 <defunct>
 1873 p0 S     0:00 klogind
 1874 p0 S     0:02 -csh (csh)
 1878 p0 R     0:00 ps ax
E40-PO# 

Who do they belong to you ask?

E40-PO# ps axl
      F UID   PID  PPID CP PRI NI ADDR  SZ  RSS WCHAN  STAT TT  TIME COMMAND
      3   0     0     0  0 -25  0  305   0    0 runout D    ?   0:00 swapper
1008001   0     1     0  0   5  0  90e  13    9 proc   I    ?   0:00 init
1000003   0     2     0  0 -24  0  900 608    0 proc   D    ?   0:00 pagedaemon
1008001   0    92     1  0   1  0  a48  52   39 selwai S    ?   0:33 /etc/named
1008001   0   103     1  0   1  0  b9e  31   22 selwai S    ?   0:48 /etc/syslo
1008201   0   121     1  0  15  0  a8e   4    2 u      S    ?   0:09 /etc/updat
1008201   0   125     1  0  15  0  a66  19   10 u      I    ?   0:08 /etc/cron
1008001   0   131     1  0   1  0  d32  29   21 selwai I    ?   0:00 /etc/inetd
1008001   0   154     1  0   1  0  e9c  67   57 mbutl  I    ?   0:10 /usr/lib/s
1008001   0   283     1  0   1  0  dcc  38   32 selwai S    ?   0:28 /usr/etc/k
1408001   0   285     1  0   3  0  ad0  25   18 cons   I    co  0:00 - std.9600
1408401   0   443   283  8  27  0  b9e   0    0        Z    ?   0:00 <defunct>
1408401   0   515   283 14  28  0  d32   0    0        Z    ?   0:00 <defunct>
1408401   0   605   283 11  27  0 163c   0    0        Z    p0  0:00 <defunct>
1408401   0   854   283 15  28  0  a48   0    0        Z    ?   0:00 <defunct>
1408401   0   948   283 24  31  0  a48   0    0        Z    ?   0:00 <defunct>
1408401   0  1117   283  5  26  0    0   0    0        Z    ?   0:00 <defunct>
1408401  50  1174   283 17  29  0 163c   0    0        Z    p0  0:00 <defunct>
1408401   0  1638   283  9  27  0 163c   0    0        Z    p0  0:00 <defunct>
1408401   0  1715   283 10  27  0 163c   0    0        Z    p0  0:00 <defunct>
1408401   0  1717   283  9  27  0 163c   0    0        Z    p0  0:00 <defunct>
1408401   0  1861   283 18  -5  0 1560   0    0        Z    p0  0:00 <defunct>
1008001   0  1873   283  0   1  0 1608  44   15 selwai S    p0  0:00 klogind
1408201   0  1874  1873  0  15  0 163c  33   21 u      S    p0  0:02 -csh (csh)
1008001   0  1883  1874 34  33  0 1560  83   60        R    p0  0:00 ps axl
E40-PO# 

Aha!  The dastardly accusing finger is pointing at Mr. /usr/etc/knetd!

Here's a look at the connection state of the machine around that time:

E40-PO# netstat -a
Active Internet connections (including servers)
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp        0      0  E40-PO.MIT.EDU.knetd   E40-343A-2.MIT.E.1268  ESTABLISHED
tcp        0      0  E40-PO.MIT.EDU.knetd   ARIADNE.MIT.EDU.1125   TIME_WAIT
tcp        0     78  E40-PO.MIT.EDU.knetd   PADDINGTON.MIT.E.1021  ESTABLISHED
tcp        0      0  *.knetd                *.*                    LISTEN
tcp        0      0  *.smtp                 *.*                    LISTEN
tcp        0      0  *.klogin               *.*                    LISTEN
tcp        0      0  *.write                *.*                    LISTEN
tcp        0      0  *.finger               *.*                    LISTEN
tcp        0      0  *.exec                 *.*                    LISTEN
tcp        0      0  *.login                *.*                    LISTEN
tcp        0      0  *.shell                *.*                    LISTEN
tcp        0      0  *.telnet               *.*                    LISTEN
tcp        0      0  *.ftp                  *.*                    LISTEN
tcp        0      0  *.nameserv             *.*                    LISTEN
udp        0      0  *.ntalk                *.*                   
udp        0      0  *.talk                 *.*                   
udp        0      0  *.biff                 *.*                   
udp        0      0  *.tftp                 *.*                   
udp        0      0  *.syslog               *.*                   
udp        0      0  *.nameserv             *.*                   
E40-PO# 

Looks innocent enough you say?  Well, once again this morning around
8:00 am the number of defunct processes was large enough to fill up
the process table to the point where I couldn't even log in as root on
the console port, not to mention the fact that users couldn't retrieve
their mail.  

The only solution at that point is to halt the machine and reboot,
carefully.

Could we track this down once and for all?  This is really the only
persistent bug which results in denial of service to PO clients, apart
from the `locked mailbox' where lingering TCP connections don't die.
My hunch is the latter could be fixed if the tcp keepalive option were
being properly invoked by the server so that connections would time
out.

-Ron

home help back first fref pref prev next nref lref last post