[1400] in Release_7.7_team
Re: 8.2.8 slowdown
daemon@ATHENA.MIT.EDU (Ted McCabe)
Wed Jul 29 14:34:14 1998
To: "Ask, and it will be given you" <mbarker@MIT.EDU>
Cc: release-team@MIT.EDU, kcr@MIT.EDU, jhawk@MIT.EDU, ops@MIT.EDU,
network@MIT.EDU
In-Reply-To: Your message of "Wed, 29 Jul 1998 12:23:32 EDT."
<9807291628.AA08927@MIT.MIT.EDU>
Date: Wed, 29 Jul 1998 14:34:08 EDT
From: Ted McCabe <ted@MIT.EDU>
[I started composing this before Greg's message, which summarizes
the problem nicely. This message thus just details FTR the effects.]
> however, do we really understand the problem? And what can we do to
> avoid repeating it?
More information: I looked at talos during the morning, at thich time
it was the only machine still giving out SDI alerts. It appeared to
be extremely burdened. fileserver process was running between 30-45%
of CPU. iostat showed that the disk thruput was running below
capacity, but that doesn't mean much.
According to our stats of AFS connections, the heavy update load ended
by 10:30 except in W20 where it continued for about another hour.
Note: Daphne AND Talos are in W20, both had heavy load until nearly
noon.
Backup cloning, which utilizes CPU and disk on the servers, was
affected as follows:
Server Start time Duration Usual duration
ixion 5:39 48min 30min
talos 6:29 96min 24min
daphne 8:15 63min 32min
typhon 10:39 27min 26min
Typhon's time is normal, due no doubt to starting so late - which
talos helped. From jhawk's graph (rel-eng discuss meeting [3825]), it
appears the talos tried to clone when demand for the os volumes was
highest.
One last bit of very impressive data. vos examine <volumename> will
show, among other things, the number of vnode for a volume - which
basically means a count of the number of times clients asked the
server for information about the filesystem in that volume
(information like file contents, size, modification time, acl).
On a normal day, the access count for all the volumes on one of our
heavily loaded servers is in the range of 5-800,000.
The accesses for the sun4 os volumes alone as of ~noon were:
Talos 4601549
Daphne 4268396
Typhon 3450763
Ixion 3670343
This indicates that the four servers were under an average load that
is approx. 8-18 times the normal, high-load, average (this figure is
smaller than what I told some people verbally - the result of still
not having all the info together).
This data, plus jhawk's, all indicate that the four servers with
the sun4 OS volumes were operating near capacity. I'd have to see
some more network stats to be sure, but I'd bet that from moment to
moment the "bottleneck" was changing between server CPU, server disk,
and network bandwidth - it looks like we have the three well balanced
right now.
--Ted