[1694] in Hotline Meeting
m66-070-server
daemon@ATHENA.MIT.EDU (David Krikorian)
Thu Sep 13 00:37:47 1990
Date: Thu, 13 Sep 90 00:37:15 -0400
From: David Krikorian <dkk@ATHENA.MIT.EDU>
To: op@ATHENA.MIT.EDU, hotline@ATHENA.MIT.EDU
Cc: oliver@ATHENA.MIT.EDU, kyrlidis@ATHENA.MIT.EDU, jfc@ATHENA.MIT.EDU
In-Reply-To: [0884] in op
Reply-To: dkk@mit.edu
With the invaluable help of Jim Oliver (oliver@athena) and jfc, I was
able to solve the m66-070-server mystery tonight.
The /site partition on the server was filling up periodically:
------------
tisiphone# rsh m66-070-server 'grep "file system full" /usr/adm/messages'
Aug 5 01:26:12 m66-070-server vmunix: /site: file system full
Sep 1 01:36:34 m66-070-server vmunix: /site: file system full
Sep 2 01:36:29 m66-070-server vmunix: /site: file system full
Sep 3 01:36:42 m66-070-server vmunix: /site: file system full
Sep 4 01:38:08 m66-070-server vmunix: /site: file system full
Sep 5 01:37:51 m66-070-server vmunix: /site: file system full
Sep 6 01:41:15 m66-070-server vmunix: /site: file system full
Sep 7 01:41:02 m66-070-server vmunix: /site: file system full
Sep 8 01:39:55 m66-070-server vmunix: /site: file system full
Sep 9 01:40:47 m66-070-server vmunix: /site: file system full
Sep 10 01:40:07 m66-070-server vmunix: /site: file system full
Sep 11 01:44:37 m66-070-server vmunix: /site: file system full
Sep 12 01:43:47 m66-070-server vmunix: /site: file system full
------------
Do those times look familiar? They should. That's during the Moira
update. To be more specific, that's exactly once during every Moira
update since the beginning of the month. Comparing /usr/etc/cred* on
m66-070-server and Cezanne (another RT NFS server), showed that
credentials.pag (the large binary database file used for NFS mappings
from a client) was only about 90% its expected size on m66-070-server.
I deleted a couple of old, useless copies of credentials.pag, rebuilt
the database with the mkcred command, and restarted rpc.mountd.
Service is back to its normal flawless state.
Lessons:
- It's nearly essential for a user to demonstrate to us interactively
what is failing and how it's failing. We already knew that, but I
just thought I'd point it out again. If I didn't have Jim to work
with, I wouldn't have known where to start looking for problems,
because there are SO MANY possibilities.
*** OPS READ THIS ***
- An RT NFS fileserver MUST have 5meg free on its /site partition for
the Moira credentials update to succeed. A VAX NFS fileserver only
needs about 2meg. Since we need to leave at least 10meg or 15meg
for the rest of the /site partition, that means the bare minimum
size for an RT NFS fileserver /site partition is about 15meg, and
the preferred size should be at least 25meg. (We shouldn't deploy
anything with less.) The server in question had a /site paritition
of 28meg.