[2991] in Athena Bugs
can't run big stuff
daemon@ATHENA.MIT.EDU (don@ATHENA.MIT.EDU)
Thu Aug 24 21:13:37 1989
From: <don@ATHENA.MIT.EDU>
Date: Thu, 24 Aug 89 21:13:25 -0400
To: bugs@ATHENA.MIT.EDU
i've been working on the swap fragmentation/leak bug ("pstat -s") for a week.
most of my effort has been spent on replicating the bug with a simple
program. the test program now can fragment 6 half-meg chunks of swap
into lots of little pieces, in about an hour. that is, even when the
processes are killed and their swap-space is freed, the swap manager doesn't
successfully coalesce all of the virtual memory the processes used.
the test program works by spawning a spaced series of children,
each of whom allocates lots of memory in weird-sized chunks, and then dies.
i tuned the code for obnoxiousness, of course; for example, i got a factor of
10-20 in fragmentation-rate, by making the processes specialize in their
chunk-preference, and by making the big-chunk hogs eat slower than the
little-chunk hogs. this made sense at the time, but i'm damned if i can
explain why. the code is in ~don/vleak2.c .
not surprisingly, the bug probably depends on process-contention;
adjusting the spawn-rate to maximize the number of surviving children
pretty reliably increased the frag-rate.
interestingly, the bug seems not to be a leak, since running the code
overnight has sometimes led to a lost half-meg chunk being successfully
recovered by coalescence.
interestingly, 3 meg seems to be the limit for my setup; running the code
overnight doesn't aggravate the situation. also, mark rosenstein has found
that increasing his disk's swap-area size made the problem go away. that is,
it didn't merely delay the crunch, nor did it just make the problem bearable.
so, the good news is that the public clusters probably won't lose all of
their swap, but the bad news is that "reboot" is still the only cheap fix.
i'm treating this as a priority, on the assumption that somebody's gonna
be screaming loud about this by midterm; i hope to know what's going on,
before that time.
please, in further bug-reports on this, note how many of each biggie-process
(xlogin, mwm, emacs, saber, scribe, latex, xterm, etc.) you were running
at the time, at least approximately. also, please note the pstat -s values,
both just before and just after you reboot. i'm interested in how variable
the behavior is.
taking them out of context, i'll quote the berkeley folks' book (p. 12):
" 4.3bsd is not perfect. In particular, the virtual-memory system needs to be
completely replaced..."
-love and kisses to all, don