[12077] in Athena Bugs
["John T. Kohl": [mcba@newt.phys.unsw.edu.au: ULTRIX 4.4: Does it still have the ancient "select collision" bug?]]
daemon@ATHENA.MIT.EDU (Marc Horowitz)
Mon May 23 10:17:53 1994
To: bugs@MIT.EDU, postmaster@MIT.EDU
Date: Mon, 23 May 1994 10:18:24 -0400
From: Marc Horowitz <marc@cam.ov.com>
This could help explain the po-server meltdown and mailhub hosage
problems, too.
Could someone compile a new fifo_gnodeops.o and put it somewhere? I'd
volunteer, but I'm not licensed.
Marc
------- Forwarded Message
Forwarded: Mon, 23 May 1994 10:15:57 -0400
Forwarded: "lore@mit.edu "
Received: from MIT.EDU by pad-thai.aktis.com (8.6.8/) with SMTP
id <IAA24493@pad-thai.aktis.com>; Mon, 23 May 1994 08:28:57 -0400
Received: from gw.atria.com by MIT.EDU with SMTP
id AA07235; Mon, 23 May 94 08:27:11 EDT
Received: from banana (banana.atria.com) by gw.atria.com id <AA21495@gw.atria.com> Mon, 23 May 94 08:27:04 EDT
Received: by banana; id AA01559; Mon, 23 May 1994 08:27:03 -0400
Date: Mon, 23 May 1994 08:27:03 -0400
From: "John T. Kohl" <jtk@atria.com>
Message-Id: <9405231227.AA01559@banana>
To: sipb-staff@MIT.EDU, usenet@MIT.EDU
Subject: [mcba@newt.phys.unsw.edu.au: ULTRIX 4.4: Does it still have the ancient "select collision" bug?]
X-Us-Snail: Atria Software Inc., 24 Prime Park Way, Natick, MA 01760
Could this explain any of the various lossage we've seen on the SIPB
service machines?
Newsgroups: comp.unix.ultrix
From: mcba@newt.phys.unsw.edu.au (Michael C. B. Ashley)
Subject: ULTRIX 4.4: Does it still have the ancient "select collision" bug?
Keywords: dinosaur dodo extinct ULTRIX
Nntp-Posting-Host: newt.phys.unsw.edu.au
Organization: University of New South Wales
Date: Fri, 20 May 1994 03:55:29 GMT
Question: has DEC fixed in ULTRIX 4.4 the "select collision" bug that
has affected ULTRIX for years? If not, can we please have
a kernel patch of some sort?
History: the bug was pointed out by Corey Satten, in 1993 in this
newsgroup. I have been troubled by it since 1991.
Affect of bug: it makes ULTRIX almost useless for running a busy
system. This is because every time a select collision
occurs, all processes waiting for a select are swapped into
memory. On a busy system (e.g., mine with 50+ users, 500
processes, 168 Mbytes of memory) this will cause almost
continual swapping, even though the total active virtual
memory of processes that should be running is half the
physical memory.
What programs cause select collisions: logging out of "dxsession"
will do it, "ghostview" does it, as do numerous other
programs. The fault is in ULTRIX.
How to fix the problem: a two line patch to the file fifo_gnodeops.c
will fix the problem. The patch is included below (it was
posted to comp.unix.ultrix about a year ago)
How to determine if ULTRIX 4.4 has the bug?: compile and run the
following program. I would be interested to hear of your
results (4.4 hasn't reached this University yet). Another
way of examining the affect of the bug is to run "vmstat -v
5", and if you see spikes in the 'sl' and 'avm' columns,
you are being affected. For example:
procs memory
r dw pw sl w vm avm rm arm fre
3 0 0 35 0 332k 113k 119k 54k 10k
3 0 0 31 0 333k 105k 120k 52k 10k
4 0 0 28 0 333k 107k 120k 53k 9700
3 0 0 27 0 333k 106k 120k 52k 9404
4 0 2195 0 334k 263k 120k 100k 9560 <- spike, causes
5 0 0197 0 334k 263k 120k 100k 9416 swapping
Some other questions: do the sites that beta-test ULTRIX know what
they are doing? Do they read comp.unix.ultrix? Does DEC
read comp.unix.ultrix? Will they take any more notice of
bug reports from users regarding OSF/1?
- -------------------------------------------------------------------
/* A short program to print out the value of the kernel symbol
"nsellcol", the number of select collisions since boot time. It
runs continuously until killed, and tests "nsellcol" every second.
If it increments, the new value is printed, along with the time,
and a dummy program is executed. Note, this program doesn't check
return values from calls; it requires read access to /dev/kmem.
Compile with
cc -o nsell nsell.c (or whatever you want to call it)
also, to help finding out which process is causing the problem,
this program executes a dummy program called ./NSELLCOL______ which
results from compiling "main () {}". Then by looking at "lastcomm"
you can readily see the location in time of the select collisions
in relation to when other processes exited.
One way to trigger a select collision is simply to logout
using 'dxsession'. 'ghostview' also causes lots of select
collisions.
Tested on ULTRIX 4.3A on a DECstation 5000/260.
Incidentally, I get about one select collision every 20 seconds,
this is on a busy system (50+ users, 400-500 processes).
Michael Ashley / Uni of NSW / mcba@newt.phys.unsw.edu.au / 19 May 1994
*/
#include <stdio.h>
#include <nlist.h>
#include <time.h>
struct nlist nlst[] = {
{ "_nselcoll" },
{ 0 },
};
long lseek();
static int kmem = -1;
getkval(unsigned long offset, int *ptr, int size, char *refstr) {
lseek (kmem, (long)offset, 0);
read (kmem, (char *)ptr, size);
}
main() {
int i, last;
time_t clock;
char *c;
kmem = open("/dev/kmem", 0);
nlist ("/vmunix", nlst);
last = 0;
while (1) {
getkval (nlst[0].n_value, (int *)(&i), sizeof(i), nlst[0].n_name);
if (i != last) {
clock = time ((time_t *)0);
system ("./NSELCOLL_______");
c = ctime (&clock);
printf ("nselcoll=%i; %s", i, c);
last = i;
}
sleep (1);
}
}
- ----------------------------------------------------------------------------
Here is the patch to fix the problem - if you have source code. In my
opinion DEC should provide source code free to all ULTRIX sites. There
are so many bugs in ULTRIX that to get a workable system you *have*
to be able to inspect/change the source.
*** ../orig/fifo_gnodeops.c Mon Apr 29 13:16:00 1991
- --- ../patched/fifo_gnodeops.c Mon Mar 8 10:53:10 1993
***************
*** 36,41 ****
- --- 36,44 ----
/************************************************************************
* Modification History
*
+ * 13 Jan 93 -- burr@cs via fmf
+ * Select collision on pipes fix from Nancy Johnson Burr.
+ *
* 27 Feb 91 -- chet
* Fix filesystem timestamping
*
***************
*** 624,630 ****
if ((fp->fn_size != 0) || (fp->fn_wcnt == 0))
ret_val = 1;
else {
! if (fp->fn_rselp)
fp->fn_flag |= IF_RCOLL;
else
fp->fn_rselp = u.u_procp;
- --- 627,633 ----
if ((fp->fn_size != 0) || (fp->fn_wcnt == 0))
ret_val = 1;
else {
! if (fp->fn_rselp && fp->fn_rselp->p_wchan == (caddr_t)& selwait)
fp->fn_flag |= IF_RCOLL;
else
fp->fn_rselp = u.u_procp;
***************
*** 635,641 ****
if ((fp->fn_size < PIPE_BUF) || (fp->fn_rcnt == 0))
ret_val=1;
else {
! if (fp->fn_wselp)
fp->fn_flag |= IF_WCOLL;
else
fp->fn_wselp = u.u_procp;
- --- 638,644 ----
if ((fp->fn_size < PIPE_BUF) || (fp->fn_rcnt == 0))
ret_val=1;
else {
! if (fp->fn_wselp && fp->fn_wselp->p_wchan == (caddr_t)& selwait)
fp->fn_flag |= IF_WCOLL;
else
fp->fn_wselp = u.u_procp;
- ----------
regards,
Michael
Michael Ashley; Dept. Astrophysics, Uni of NSW; mcba@newt.phys.unsw.edu.au
------- End of Forwarded Message