[12077] in Athena Bugs

home help back first fref pref prev next nref lref last post

["John T. Kohl": [mcba@newt.phys.unsw.edu.au: ULTRIX 4.4: Does it still have the ancient "select collision" bug?]]

daemon@ATHENA.MIT.EDU (Marc Horowitz)
Mon May 23 10:17:53 1994

To: bugs@MIT.EDU, postmaster@MIT.EDU
Date: Mon, 23 May 1994 10:18:24 -0400
From: Marc Horowitz <marc@cam.ov.com>


This could help explain the po-server meltdown and mailhub hosage
problems, too.

Could someone compile a new fifo_gnodeops.o and put it somewhere?  I'd
volunteer, but I'm not licensed.

		Marc

------- Forwarded Message

Forwarded: Mon, 23 May 1994 10:15:57 -0400
Forwarded: "lore@mit.edu "
Received: from MIT.EDU by pad-thai.aktis.com (8.6.8/) with SMTP
	id <IAA24493@pad-thai.aktis.com>; Mon, 23 May 1994 08:28:57 -0400
Received: from gw.atria.com by MIT.EDU with SMTP
	id AA07235; Mon, 23 May 94 08:27:11 EDT
Received: from banana (banana.atria.com) by gw.atria.com id <AA21495@gw.atria.com> Mon, 23 May 94 08:27:04 EDT    
Received: by banana; id AA01559; Mon, 23 May 1994 08:27:03 -0400
Date: Mon, 23 May 1994 08:27:03 -0400
From: "John T. Kohl" <jtk@atria.com>
Message-Id: <9405231227.AA01559@banana>
To: sipb-staff@MIT.EDU, usenet@MIT.EDU
Subject: [mcba@newt.phys.unsw.edu.au: ULTRIX 4.4: Does it still have the ancient "select collision" bug?]
X-Us-Snail: Atria Software Inc., 24 Prime Park Way, Natick, MA  01760

Could this explain any of the various lossage we've seen on the SIPB
service machines?

Newsgroups: comp.unix.ultrix
From: mcba@newt.phys.unsw.edu.au (Michael C. B. Ashley)
Subject: ULTRIX 4.4: Does it still have the ancient "select collision" bug?
Keywords: dinosaur dodo extinct ULTRIX
Nntp-Posting-Host: newt.phys.unsw.edu.au
Organization: University of New South Wales
Date: Fri, 20 May 1994 03:55:29 GMT

Question: has DEC fixed in ULTRIX 4.4 the "select collision" bug that
          has affected ULTRIX for years?  If not, can we please have
          a kernel patch of some sort?

History: the bug was pointed out by Corey Satten, in 1993 in this
          newsgroup.  I have been troubled by it since 1991.

Affect of bug: it makes ULTRIX almost useless for running a busy
          system.  This is because every time a select collision
          occurs, all processes waiting for a select are swapped into
          memory.  On a busy system (e.g., mine with 50+ users, 500
          processes, 168 Mbytes of memory) this will cause almost
          continual swapping, even though the total active virtual
          memory of processes that should be running is half the
          physical memory.

What programs cause select collisions: logging out of "dxsession"
          will do it, "ghostview" does it, as do numerous other
          programs. The fault is in ULTRIX.

How to fix the problem: a two line patch to the file fifo_gnodeops.c
          will fix the problem.  The patch is included below (it was
          posted to comp.unix.ultrix about a year ago)

How to determine if ULTRIX 4.4 has the bug?: compile and run the
          following program.  I would be interested to hear of your
          results (4.4 hasn't reached this University yet).  Another
          way of examining the affect of the bug is to run "vmstat -v
          5", and if you see spikes in the 'sl' and 'avm' columns,
          you are being affected.  For example:

              procs            memory
          r dw pw sl  w   vm   avm  rm  arm  fre
          3  0  0 35  0  332k 113k 119k  54k  10k
          3  0  0 31  0  333k 105k 120k  52k  10k
          4  0  0 28  0  333k 107k 120k  53k 9700
          3  0  0 27  0  333k 106k 120k  52k 9404
          4  0  2195  0  334k 263k 120k 100k 9560 <- spike, causes
          5  0  0197  0  334k 263k 120k 100k 9416      swapping

Some other questions: do the sites that beta-test ULTRIX know what
          they are doing?  Do they read comp.unix.ultrix?  Does DEC
          read comp.unix.ultrix?  Will they take any more notice of
          bug reports from users regarding OSF/1?

- -------------------------------------------------------------------

/* A short program to print out the value of the kernel symbol
  "nsellcol", the number of select collisions since boot time.  It
  runs continuously until killed, and tests "nsellcol" every second.
  If it increments, the new value is printed, along with the time,
  and a dummy program is executed. Note, this program doesn't check
  return values from calls; it requires read access to /dev/kmem.

  Compile with

  cc -o nsell nsell.c   (or whatever you want to call it)

  also, to help finding out which process is causing the problem,
  this program executes a dummy program called ./NSELLCOL______ which
  results from compiling "main () {}". Then by looking at "lastcomm"
  you can readily see the location in time of the select collisions
  in relation to when other processes exited.

  One way to trigger a select collision is simply to logout
  using 'dxsession'. 'ghostview' also causes lots of select
  collisions.

  Tested on ULTRIX 4.3A on a DECstation 5000/260.

  Incidentally, I get about one select collision every 20 seconds,
  this is on a busy system (50+ users, 400-500 processes).

  Michael Ashley / Uni of NSW / mcba@newt.phys.unsw.edu.au / 19 May 1994
*/
#include <stdio.h>
#include <nlist.h>
#include <time.h>

struct nlist nlst[] = {
    { "_nselcoll" },
    { 0 },
};

long lseek();
static int kmem = -1;

getkval(unsigned long offset, int *ptr, int size, char *refstr) {
  lseek (kmem, (long)offset, 0);
  read (kmem, (char *)ptr, size);
}

main() {
  int i, last;
  time_t clock;
  char *c;

  kmem = open("/dev/kmem", 0);
  nlist ("/vmunix", nlst);
  last = 0;
  while (1) {
    getkval (nlst[0].n_value, (int *)(&i), sizeof(i), nlst[0].n_name);
    if (i != last) {
      clock = time ((time_t *)0);
      system ("./NSELCOLL_______");
      c = ctime (&clock);
      printf ("nselcoll=%i; %s", i, c);
      last = i;
    }
    sleep (1);
  }
}

- ----------------------------------------------------------------------------
Here is the patch to fix the problem - if you have source code. In my
opinion DEC should provide source code free to all ULTRIX sites. There
are so many bugs in ULTRIX that to get a workable system you *have*
to be able to inspect/change the source.

*** ../orig/fifo_gnodeops.c  Mon Apr 29 13:16:00 1991
- --- ../patched/fifo_gnodeops.c  Mon Mar  8 10:53:10 1993
***************
*** 36,41 ****
- --- 36,44 ----

/************************************************************************
   *      Modification History
   *
+  * 13 Jan 93 -- burr@cs via fmf
+  *  Select collision on pipes fix from Nancy Johnson Burr.
+  *
   * 27 Feb 91 -- chet
   *  Fix filesystem timestamping
   *
***************
*** 624,630 ****
  if ((fp->fn_size != 0) || (fp->fn_wcnt == 0))
    ret_val = 1;
  else  {
!   if (fp->fn_rselp)
      fp->fn_flag |= IF_RCOLL;
    else
      fp->fn_rselp = u.u_procp;
- --- 627,633 ----
  if ((fp->fn_size != 0) || (fp->fn_wcnt == 0))
    ret_val = 1;
  else  {
!   if (fp->fn_rselp && fp->fn_rselp->p_wchan == (caddr_t)& selwait)
      fp->fn_flag |= IF_RCOLL;
    else
      fp->fn_rselp = u.u_procp;
***************
*** 635,641 ****
  if ((fp->fn_size < PIPE_BUF) || (fp->fn_rcnt == 0))
    ret_val=1;
  else {
!   if (fp->fn_wselp)
      fp->fn_flag |= IF_WCOLL;
    else
      fp->fn_wselp = u.u_procp;
- --- 638,644 ----
  if ((fp->fn_size < PIPE_BUF) || (fp->fn_rcnt == 0))
    ret_val=1;
  else {
!   if (fp->fn_wselp && fp->fn_wselp->p_wchan == (caddr_t)& selwait)
      fp->fn_flag |= IF_WCOLL;
    else
      fp->fn_wselp = u.u_procp;

- ----------
regards,
Michael

Michael Ashley; Dept. Astrophysics, Uni of NSW; mcba@newt.phys.unsw.edu.au


------- End of Forwarded Message


home help back first fref pref prev next nref lref last post