[982] in linux-net channel archive

home help back first fref pref prev next nref lref last post

Re: heavily-accessed site w/ random hangs on specific port... suggestions?

daemon@ATHENA.MIT.EDU (Nick Simicich)
Thu Aug 24 04:02:09 1995

Date: Wed, 23 Aug 1995 09:52:24 -29900
From: Nick Simicich <njs@scifi.maid.com>
To: "Matti E. Aarnio [OH1MQK]" <mea@mea.cc.utu.fi>
cc: Michael Brennen <mbrennen@puddytat.intecom.com>,
        linux-net@vger.rutgers.edu, emo@cica.cica.indiana.edu
In-Reply-To: <95Aug22.122353+0200eet_dst.62998-2+21@mea.utu.fi>

On Tue, 22 Aug 1995, Matti E. Aarnio [OH1MQK] wrote:

> > Subject: heavily-accessed site w/ random hangs on specific port... suggestions?
> > 
> > Greetings All,
> > 
> > I've read testimonials from folks who administer active Internet 
> > sites which are using Linux on a PC and I am hoping that a few
> > of you will read this note and be able to offer some suggestions.
> > Has anyone else experienced problems similar to the ones described 
> > below and what corrective actions were taken to resolve them?
> 
> 	I have similar experiences, but on alltogether
> 	different platform!
> 
> > Our site hosts one of the world's most active ftp sites: the CICA ftp software
> > archive.  It also runs httpd and gopher servers as well as handling
> > a normal load of email and system-related work, e.g. compiles, file edits,
> > etc.
> 
> 	Yes, the problem requires ultra-high server popularity to happen..
> 
> > Here is the current configuration:
> ...
> > Software
> > -----------------------------------------------------------------------------
> > Slackware 2.3
> > Linux 1.2.10
> > xinetd 2.1.4-linux.3 (master server)
> > wu-ftpd 2.4          (ftp server)
> > NCSA httpd 1.4.1     (http server)
> > gn 2.08              (gopher server)
> > -----------------------------------------------------------------------------
> ...
> > We have been experiencing strangeness whereby the system "burps" for a, 
> > sometimes extended, period of time, and connections to port 21 are not 
> > possible -- they just timeout.  However, we configured an ftp service
> > for port 1111 and have discovered that during a "burp" on port 21,
> > connections to wu-ftpd running on port 1111 are just fine.  Telnet (login)
> > to the machine also works speedily.  Random delays have occurred with as low 
> > as 90 and as high as 190 simultaneous connections to port 21; the more 
> > connections, the more likelihood of a delay...
> 
> 	Right, exactly same symptoms.  Now the big surprise:
> 
> 		ftp.funet.fi  running  DEC OSF/1 v 3.2A !
> 
> ...
> > The question arises: why do ftp connections to port 1111, telnet/login
> > to port 23, and smtp via port 25 all connect immediately without delay*?
> > Connections to these ports depend on allocation of sockets just like the ftp 
> > connection requests to port 21.  What are the causative differences here?
> > 
> > Log file data seems to indicate that wu-ftpd is not being forked by
> > xinetd on port 21 during these burps while wu-ftpd is being forked speedily
> > in response to port 1111 requests.  Thus, the problem does not appear
> > to be directly caused by wu-ftpd.  Nor does it appear to be caused by
> > xinetd.   If wu-ftpd can be initiated on port 1111 by xinetd, why 
> > can't the same executable be forked to communicate with port 21?
> 
> 	I made our own ftpd version (a bit older that wu-ftpd, btw.)
> 	to do its own "inetd" work. Essentially it just creates a listen
> 	socket, on which it sits  accept()ing, and when it gets a new
> 	connection, it fork()s off and returns to accept()..
> 
> 	When the jam happens, network monitors (persons with equipment
> 	on right spots on our ATM and FDDI nets) say that  ftp.funet.fi
> 	gets  TCP-SYNs (so, requests to create connection) into FTP
> 	control port, however it does not send TCP-ACK for it -> connection
> 	will not get estabilished.   Simultaneous system call trace on
> 	that machine also shows that  accept() does not happen.
> 
> > During these random delays existing connections on port 21 do not seem to
> > be adversely effected.
> 
> 	Yes, the data transfers can be started with ease.
> 
> ...
> > Any assistance or suggestions folks may lend to resolution of this problem
> > will be greatly appreciated.  The net is a great resource; especially, the
> > thousands of folks working on developing kernel and application level
> > code for Linux.  
> 
> 	I am beginning to wonder, if Linux network code does something
> 	wrong in the same way as BSD one ?  (The OSF/1 networking code
> 	is clearly BSD code.)   Or is it something more sinister ...
> 	... the TCP/IP protocol is at fault ?
> 
> > Any words of wisdom out there?
> > 
> > Thanks,
> > 
> > eric
> 
> 	/Matti Aarnio <mea@utu.fi> <mea@nic.funet.fi>

This is my guess:  I think that you have received only the first part of
the TCP three part handshake, and that you have transmitted the second
part, but you never got the third part. 

You have enough slots in this state that all of your listen() slots are 
full.

Details:

Initial syn packets are coming in to set up a connection.  You send a
syn-ack, but get no response.  This fills in a slot in the listen() queue. 
This could be for a couple of reasons. 

1.  Random network packet loss. 

2.  You are sending the return syn-acks and are getting network/host
unreachable or RST's, but you are not properly using them to tear down the
partially created connection. 

3.  People are ftping to you, and then turning off their machines.

4.  Your site is actually under attack (denial of service or sequence
number) and someone has sent you a stream of SYNs from forged addresses
that do not reply with host unreachable or port unreachable or RST and
filled in all of your listen slots for this port. 

According to the listen()/accept() manpages, there are two possible legal 
behaviours when all listen slots are full.

a)  You can send RSTs, to tear down the other end's partially opened 
connection.

b) You can throw away the SYN packets, assuming that the other end will
retransmit, and hopefully you will have a listen() slot when the
retransmit arrives. 

You retransmit the SYN-ACK a few times and then eventually the partially 
set up connections time out.  At that point, the slots free and you can 
accept more connections.

This only affects connections made to a particular TCP port.  The full 
resource is the tcp listen() queue.

To confirm this, I'd trace packets with syn by itself to port 21 on this 
machine, and syn-ack out, and syn-ack back.  I'd look for imbalances 
between the first and third part of the handshakes.  I'd run tcpdump on 
the affected machine.  You need to monitor before the jam, not after, to 
see the probem building up.


Reduce the chances of net problems causing this by increasing the size of 
the listen queue.  This won't help if you are actually attacked.


Nick Simicich - njs@scifi.emi.net - (last choice) njs@bcrvm1.vnet.ibm.com
http://scifi.emi.net/njs.html -- Stop by and Light Up The World!


home help back first fref pref prev next nref lref last post