[19503] in Athena Bugs
sun4 9.0.13: KNFS and syslogd
daemon@ATHENA.MIT.EDU (0000-Admin(0000))
Tue Jul 31 18:18:03 2001
Message-Id: <200107312218.SAA08484@weepecket.mit.edu>
To: bugs@MIT.EDU
Cc: cavin@mit.deu
Date: Tue, 31 Jul 2001 18:18:00 -0400
From: 0000-Admin(0000) <root@weepecket.mit.edu>
System name: weepecket.mit.edu
Type and version: Ultra-5_10 9.0.13 (with mkserv)
Display type: ffb
Shell: /bin/athena/tcsh
Window manager: unknown
What were you trying to do?
Use a private Athena workstation as a KNFS server, and access
files from a second private Athena workstation.
What's wrong:
A "normal" user operation on Cuttyhunk (Athena 8.4.25, and not
taking an upgrade for some reason -- not yet checked) is trying
to do I/O on Weepecket. The job does a lot of I/O (many MB,
possibly upto a GB in files that are about a MB each), and
seems to be running normally, but the file server, Weepecket,
is having problems.
Running "dmesg" on Weepecket gives about 200 of these messages
and seems to have filled up a buffer or pipe. Syslogd is also
taking a lot of CPU cycles.
Jul 31 17:33:05 weepecket.mit.edu nfssrv: [ID 785379 kern.notice] NFS server getting AKred for uid 30559
Jul 31 17:33:05 weepecket.mit.edu nfssrv: [ID 719455 kern.notice] NFS server got AKred for uid 30559
Jul 31 17:33:05 weepecket.mit.edu nfssrv: [ID 785379 kern.notice] NFS server getting AKred for uid 30559
Jul 31 17:33:05 weepecket.mit.edu nfssrv: [ID 719455 kern.notice] NFS server got AKred for uid 30559
Jul 31 17:33:05 weepecket.mit.edu nfssrv: [ID 785379 kern.notice] NFS server getting AKred for uid 30559
Jul 31 17:33:05 weepecket.mit.edu nfssrv: [ID 719455 kern.notice] NFS server got AKred for uid 30559
Jul 31 17:33:05 weepecket.mit.edu nfssrv: [ID 785379 kern.notice] NFS server getting AKred for uid 30559
Jul 31 17:33:05 weepecket.mit.edu nfssrv: [ID 719455 kern.notice] NFS server got AKred for uid 30559
Jul 31 17:33:05 weepecket.mit.edu nfssrv: [ID 785379 kern.notice] NFS server getting AKred for uid 30559
Other problems involve users being unable to logout (the
system hangs and the only way to clear them was to kill all
their processes, then send a HUP to the dm process), and
general unresponsiveness.
At some point earlier in the day, messages were being written
to the console, interfering with a user session on the
console, that was spewing a message about something being full
and wondering if syslogd was still running (it was).
Restarting syslogd cleared the messages from the console, but
also rendered "xterm -C" unable to connect to /dev/console.
What should have happened:
As far as I know, there shouldn't be any errors or warnings
from these operations.
Please describe any relevant documentation references:
Some of this may be the result of too many users on the system
at one time. One user in particular started 3 jobs at the same
time (an scp and 2 bzip2, all three of which were on GB sized
files) and on the same disk. (This lab is working on getting
more machines, but right now they have 10 people, one Ultra 10
as a file server, and very large jobs.)
The real question is whether this is normal behavior for this
configuration under extreme load, or if this is some more
general problem with the Athena 9.0 release or the interaction
between 9.0 and 8.4.
I'd like to get the lab to reboot the system, but I'd also like
to have some idea whether it will possibly make a difference.
Thanks,
--Tom