[1400] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: 8.2.8 slowdown

daemon@ATHENA.MIT.EDU (Ted McCabe)
Wed Jul 29 14:34:14 1998

To: "Ask, and it will be given you" <mbarker@MIT.EDU>
Cc: release-team@MIT.EDU, kcr@MIT.EDU, jhawk@MIT.EDU, ops@MIT.EDU,
        network@MIT.EDU
In-Reply-To: Your message of "Wed, 29 Jul 1998 12:23:32 EDT."
             <9807291628.AA08927@MIT.MIT.EDU> 
Date: Wed, 29 Jul 1998 14:34:08 EDT
From: Ted McCabe <ted@MIT.EDU>

[I started composing this before Greg's message, which summarizes
the problem nicely.  This message thus just details FTR the effects.]

> however, do we really understand the problem?  And what can we do to
> avoid repeating it? 

More information: I looked at talos during the morning, at thich time
it was the only machine still giving out SDI alerts.  It appeared to
be extremely burdened.  fileserver process was running between 30-45%
of CPU. iostat showed that the disk thruput was running below
capacity, but that doesn't mean much.

According to our stats of AFS connections, the heavy update load ended
by 10:30 except in W20 where it continued for about another hour.
Note: Daphne AND Talos are in W20, both had heavy load until nearly
noon.

Backup cloning, which utilizes CPU and disk on the servers, was
affected as follows:

Server	Start time	Duration	Usual duration
ixion	5:39		48min		30min
talos	6:29		96min		24min
daphne	8:15		63min		32min
typhon	10:39		27min		26min

Typhon's time is normal, due no doubt to starting so late - which
talos helped.  From jhawk's graph (rel-eng discuss meeting [3825]), it
appears the talos tried to clone when demand for the os volumes was
highest.

One last bit of very impressive data.  vos examine <volumename> will
show, among other things, the number of vnode for a volume - which
basically means a count of the number of times clients asked the
server for information about the filesystem in that volume
(information like file contents, size, modification time, acl).
On a normal day, the access count for all the volumes on one of our
heavily loaded servers is in the range of 5-800,000.

The accesses for the sun4 os volumes alone as of ~noon were:
Talos	4601549
Daphne	4268396
Typhon	3450763
Ixion	3670343

This indicates that the four servers were under an average load that
is approx. 8-18 times the normal, high-load, average (this figure is
smaller than what I told some people verbally - the result of still
not having all the info together).

This data, plus jhawk's, all indicate that the four servers with
the sun4 OS volumes were operating near capacity.  I'd have to see
some more network stats to be sure, but I'd bet that from moment to
moment the "bottleneck" was changing between server CPU, server disk,
and network bandwidth - it looks like we have the three well balanced
right now.

   --Ted

home help back first fref pref prev next nref lref last post