[785] in linux-net channel archive
Re: Strange behaviour with NFS
daemon@ATHENA.MIT.EDU (Craig Metz)
Sat Jul 29 12:43:03 1995
To: Donald Becker <becker@cesdis1.gsfc.nasa.gov>
cc: linux-net@vger.rutgers.edu
In-reply-to: Your message of "Wed, 26 Jul 1995 00:15:09 EDT."
<9507260415.AA10825@cesdis.gsfc.nasa.gov>
Date: Wed, 26 Jul 1995 17:46:48 -0500
From: Craig Metz <cmetz@sundance.itd.nrl.navy.mil>
In message <9507260415.AA10825@cesdis.gsfc.nasa.gov>, Donald Becker writes:
>>I found some strange behaviour according to the nfs filesystem in the
>>kernel.
>>
>>Whenever the nfs server isn't reachable the process on the client
>>machine just hangs around, partially in 'D' status which means
>>non-interruptable.
>I also encountered this problem today. This bug hung a few nodes on our
>Linux cluster. Luckily one still had a few process slots left to figure to
>do a 'ps'. Here a few notes:
> 1. The processes were in the 'D' disk-wait state. Most were
> swapped out.
> 2. The processes counted toward the load average, but didn't consume
> CPU time. The load average on the still-working machine was >35.
> 3. Doing 'kill -1' and 'kill -9' had no effect. The processes
> didn't even turn into zombies.
>
>The problem apparently started when a few NFS serving nodes were
>unavailable. I assume NFS clients might have timed out. Even when the
>servers were returned to service the processes were still hung.
Let me suggest the following mount options. They've worked well
for me in avoiding this kind of problem:
soft,bg,intr,timeo=2,retrans=2,retry=2,rsize=4096,wsize=4096
The key here is that, without retry set, I believe the code
will retry ad infinitum (or is that retrans? I never keep those straight).
This is a "feature" derived from the Sun-written reference code's behavior.
A hard mount retries ad infinitum, as well. Even though Sun recommends you
use hard mounts for anything you can write to, I always use soft mounts.
I've never had a problem with trying to write a block when a server dies
causing me to lose work I can't save somewhere else, but I have servers
going down just frequently enough that having my processes hang on disk
I/O is not cool.
The process should get fixed within some reasonable amount of
time after the server comes back up if you are using soft mounts. With
hard mounts, if the vnode number changes, I'm not sure that it they are
supposed to fix themselves. If you're using soft mounts and the processes
aren't fixing themselves, then there is a real problem.
-Craig