[7443] in www-talk@info.cern.ch

home help back first fref pref prev next nref lref last post

Re: mystery NCSA httpd problems on gnn.com

daemon@ATHENA.MIT.EDU (Rob McCool)
Mon Jan 30 19:22:57 1995

Date: Tue, 31 Jan 1995 00:49:25 +0100
Errors-To: listmaster@www0.cern.ch
Reply-To: robm@neon.mcom.com
From: Rob McCool <robm@neon.mcom.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>

/*
 * "Re: mystery NCSA httpd problems on gnn.com" by Robert S. Thau
 *    written Mon, 30 Jan 95 14:02:38 EST
 * 
 * I believe I've seen the same sort of thing (incoming connections
 * timing out, no new connections being logged, CPU and disks dead,
 * main server process shows up as blocked in accept() if I gcore(1)
 * it and do a backtrace).  Note that this doesn't seem to be entirely
 * consistent with the "accept queue backup" story --- if the accept()
 * queue on the socket is full to bursting, why doesn't the server
 * accept new connections?

Because the queue is used both for connections that are ready to be
accepted as well as for half-negotiated connections. The latter can
fill the queue, starving any new connections from being negotiated.

 * One other piece of puzzling evidence --- intense bursts of
 * connections don't always provoke the bug.  I try to keep track of
 * peak load here by logging a histogram of transactions/sec
 * vs. number-of-seconds.  We routinely log bursts of >10
 * transactions/sec a few times a day even on weekends, when this sort
 * of "freeze-up" behavior doesn't seem to have been a problem.

We've always been able to track it down to a line being down. When the
watchdogs report a server not responding (both of them invariably do
it at the same time BTW even though they're on different outbound
lines), my first step is to look for a down route. Out of a list of
10-20 hosts, I ping to each one and usually by the second or third one
I encounter a failure. Traceroute can then generally find the down
route.

 * Incidentally, killing off the server process and restarting it
 * always gets things moving again (at least it does here), so that
 * action seems to clear whatever inside the kernel is causing the
 * bottleneck.

Yes, because the socket listening to port 80 is closed and then
re-opened with a fresh queue.

 * That hack seems to have helped matters, but I'm not sure that it's
 * gotten rid of the freeze-ups entirely --- I spotted something which
 * looked an awful lot like the same old freeze on Friday, although
 * this time the process was waiting in select().  If the bug keeps on
 * showing up at an annoying rate, the next thing I'll try is closing
 * and reopening the socket if no connection requests have come in for
 * ten seconds or so, but that seems a little drastic.
 */

You have to be careful to prevent race conditions there. There's a
chance people could get connection refused if they hit your server at
just the right time.

--Rob

home help back first fref pref prev next nref lref last post