[1170] in Release_7.7_team


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

Problems with NFS timeouts

daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Tue Dec 16 20:49:11 1997

From: Jonathon Weiss <jweiss@MIT.EDU>
To: athena-rcc@MIT.EDU, ezra@MIT.EDU, f_l@MIT.EDU, haldane@MIT.EDU, jf@MIT.EDU,
        karen@MIT.EDU, mbwall@MIT.EDU, mshiffer@MIT.EDU, network@MIT.EDU,
        nschmidt@MIT.EDU, ops@MIT.EDU, phils@MIT.EDU, release-team@MIT.EDU,
        takehiko@MIT.EDU, tfitz@MIT.EDU, thg@MIT.EDU, tom@MIT.EDU
Date: Tue, 16 Dec 1997 20:49:05 EST

*tweeeet* Time-out.  This conversation has been going on in too many
different places.  This is an attempt to bring everyone up to the same
level of information.

The symptoms: Several people have noticed NFS RPC timeouts from various
clients (in some or all of buildings 1, 4, and 10) connecting to
various servers (also in various buildings).  Generally, a short
operation like "ls" will succeed if retried immediately after it
fails.  To my knowledge these problems exist only between Solaris
Athena clients and Solaris servers.  If anyone has experienced this
sort of failure between two machines that aren't both running Solaris,
please let me know.

The problem: The problem appears to be do to an interaction between
the athene program "attach" and Solaris TCP based NFS.  Recent
versions of Solaris will use TCP if possible for NFS transactions, and
fall back to UDP if necessary.  From the Solaris mount_nfs manpage:

          timeo=n        Set the NFS timeout to  n  tenths  of  a
                         second.   The default value is 11 tenths
                         of a  second  for  connectionless  tran-
                         sports,  and  100 tenths of a second for
                         connection-oriented transports.

However, attach specifies timeo=8 for NFS connections, by default.
(Thanks to danw for tracking a lot of this down.)

Someone (but I forget who) wondered why the RPC timeout seemed almost
instantaneous given that in addition to the .8 s timeout there were
supposed to be 7 retransmits (according the the defaults attach
provides).  However, the Solaris mount_nfs manpage notes "For
connection-oriented transports, this option has no effect because it
is assumed that the transport will perform retransmissions on behalf
of NFS."

Work-arounds: There are several possible work-arounds each with
various trade-offs.

1) Specify -o timeo=100 when attaching a filesystem served by a
Solaris machine on a Solaris machine.  (tfitz has tried this and says
that it helps a lot.)

2) modify the attach.conf appending to the "options {nfs}" line:
,timeo=100
on any solaris clients that you use.  (Obiviously, this isn't really
appropriate for cluster workstations, only departmental ones.)  This
choice also has the disadvantage that it could cause UDP NFS
connections to take a much longer time to timeout if there is an
actual problem.

3) using an approach similar to either 1 or 2 above make the mount a
hard mount rather than a soft mount.  Personally, I would recommend
against this approach, since I haven't heard of it being tested
(although I have no reason to believe it won't solve the problem), and
that it may cause things to hang if there is an actual outage.

Addressing the problem: I'm going to look at attach (or get someoen
else to) to try to figure out if there's a good reason not to just
remove the defaults it is adding in this area.

	Jonathon


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

[1170] in Release_7.7_team

Problems with NFS timeouts

daemon@ATHENA.MIT.EDU (Jonathon Weiss)Tue Dec 16 20:49:11 1997

daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Tue Dec 16 20:49:11 1997