[1180] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: Problems with NFS timeouts

daemon@ATHENA.MIT.EDU (Thomas H. Grayson)
Tue Dec 23 19:59:11 1997

To: Jonathon Weiss <jweiss@MIT.EDU>
Cc: athena-rcc@MIT.EDU, ezra@MIT.EDU, f_l@MIT.EDU, haldane@MIT.EDU, jf@MIT.EDU,
        karen@MIT.EDU, mbwall@MIT.EDU, mshiffer@MIT.EDU, network@MIT.EDU,
        nschmidt@MIT.EDU, ops@MIT.EDU, phils@MIT.EDU, release-team@MIT.EDU,
        takehiko@MIT.EDU, tfitz@MIT.EDU, tom@MIT.EDU
In-Reply-To: Your message of "Tue, 16 Dec 1997 20:49:05 EST."
             <199712170149.UAA10328@the-other-woman.MIT.EDU> 
Date: Tue, 23 Dec 1997 19:59:05 EST
From: "Thomas H. Grayson" <thg@MIT.EDU>

Today Karen Walrath and I went over to the electronic classroom in 1-115 to 
test Jonathon Weiss's suggestion to increase the 'timeo' mount parameter from 
Athena's default 8 tenths of a second to the Solaris default of 100.  With 
timeo=8, we had numerous timeouts--by now a routine occurrence.  When we tried 
timeo=100, we found that all our NFS timeout problems disappeared.

On a previous occasion I had tried hard mounts on the suggestion of Tom 
Coppeto; this actually only made things worse, as the machines would simply 
hang instead of timing out.

You can set the timeo parameter when adding or attaching with the following 
syntax:

    attach -o timeo=100 locker
    add -a -o timeo=100 locker

Karen observed that our timeout problems went from being infrequent to 
continual after we installed Kerberized NFS on our servers (they are Sun Sparc 
5's running Athena 8.1).  One plausible theory is that Kerberized NFS is just 
enough slower than standard Sun NFS that it started pushing lots of requests 
over the timeout threshold.

In any case, I am much relieved to find that this simple tweak makes the 
timeout problem seemingly go away.  (I'll reserve final judgement until we run 
some heavy-duty tests on a loaded network.)  For us, anyway, it looks like we 
are now back in business for the spring term in 1-115.  Hopefully this 
workaround will work for the rest of you too.

Tom Grayson

> 
> *tweeeet* Time-out.  This conversation has been going on in too many
> different places.  This is an attempt to bring everyone up to the same
> level of information.
> 
> 
> 
> The symptoms: Several people have noticed NFS RPC timeouts from various
> clients (in some or all of buildings 1, 4, and 10) connecting to
> various servers (also in various buildings).  Generally, a short
> operation like "ls" will succeed if retried immediately after it
> fails.  To my knowledge these problems exist only between Solaris
> Athena clients and Solaris servers.  If anyone has experienced this
> sort of failure between two machines that aren't both running Solaris,
> please let me know.
> 
> 
> 
> The problem: The problem appears to be do to an interaction between
> the athene program "attach" and Solaris TCP based NFS.  Recent
> versions of Solaris will use TCP if possible for NFS transactions, and
> fall back to UDP if necessary.  From the Solaris mount_nfs manpage:
> 
>           timeo=n        Set the NFS timeout to  n  tenths  of  a
>                          second.   The default value is 11 tenths
>                          of a  second  for  connectionless  tran-
>                          sports,  and  100 tenths of a second for
>                          connection-oriented transports.
> 
> However, attach specifies timeo=8 for NFS connections, by default.
> (Thanks to danw for tracking a lot of this down.)
> 
> Someone (but I forget who) wondered why the RPC timeout seemed almost
> instantaneous given that in addition to the .8 s timeout there were
> supposed to be 7 retransmits (according the the defaults attach
> provides).  However, the Solaris mount_nfs manpage notes "For
> connection-oriented transports, this option has no effect because it
> is assumed that the transport will perform retransmissions on behalf
> of NFS."
> 
> 
> 
> 
> Work-arounds: There are several possible work-arounds each with
> various trade-offs.
> 
> 1) Specify -o timeo=100 when attaching a filesystem served by a
> Solaris machine on a Solaris machine.  (tfitz has tried this and says
> that it helps a lot.)
> 
> 2) modify the attach.conf appending to the "options {nfs}" line:
> ,timeo=100
> on any solaris clients that you use.  (Obiviously, this isn't really
> appropriate for cluster workstations, only departmental ones.)  This
> choice also has the disadvantage that it could cause UDP NFS
> connections to take a much longer time to timeout if there is an
> actual problem.
> 
> 3) using an approach similar to either 1 or 2 above make the mount a
> hard mount rather than a soft mount.  Personally, I would recommend
> against this approach, since I haven't heard of it being tested
> (although I have no reason to believe it won't solve the problem), and
> that it may cause things to hang if there is an actual outage.
> 
> 
> 
> Addressing the problem: I'm going to look at attach (or get someoen
> else to) to try to figure out if there's a good reason not to just
> remove the defaults it is adding in this area.
> 
> 
> 	Jonathon



home help back first fref pref prev next nref lref last post