[27093] in Athena Bugs

home help back first fref pref prev next nref lref last post

Slow W20 network [from openoffice propagation]

daemon@ATHENA.MIT.EDU (John Hawkinson)
Fri Jun 22 13:49:22 2007

Date: Fri, 22 Jun 2007 13:48:38 -0400 (EDT)
Message-Id: <200706221748.l5MHmc48016681@portnoy.mit.edu>
To: bugs@mit.edu
From: John Hawkinson <jhawk@mit.edu>
X-Spam-Flag: NO
X-Spam-Score: 0.00
Errors-To: bugs-bounces@mit.edu

Aftering chatting with jweiss, who wandered by as I was
wondeirng, "Where shall I send this?", I've concluded it's
reasonable to send this here, because nothing is likely to
be done in real-time and, if anything, it's a longer-term problem.

Network performance right now on W20 cluster machines is unacceptably
slow, and has been for at least the past hour or so, with ~5-20%
packet loss and ~100ms+ round-trip times to the router (18.187.0.1).

Here's an example pinging an affected workstation from an unaffected one:

[portnoy!jhawk] /afs/net/admin/hosts> ping -s w20-575-87
PING w20-575-87: 56 data bytes
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=0. time=151. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=1. time=246. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=2. time=160. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=3. time=305. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=4. time=96.3 ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=5. time=132. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=6. time=91.7 ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=7. time=90.9 ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=8. time=202. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=10. time=309. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=11. time=229. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=12. time=86.5 ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=13. time=90.9 ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=14. time=145. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=15. time=91.1 ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=16. time=82.5 ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=17. time=60.4 ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=18. time=70.0 ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=19. time=116. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=20. time=149. ms
64 bytes from W20-575-87.MIT.EDU (18.187.1.248): icmp_seq=21. time=310. ms
^C
----w20-575-87 PING Statistics----
22 packets transmitted, 21 packets received, 4% packet loss
round-trip (ms)  min/avg/max/stddev = 60.4/153.0/310./81.9

The cause of this is presumably that the openoffice update is
overloading some remaining 10mbps sections of the cluster
network. Looking at the MRTG graphs, we see the traffic spike
began at ~3:30am and it's now at 17mbps (outbound).


I'm not sure what should be done about this, and I find myself
wondering what the status of W20 network upgrades (ostensibly
scheduled for this summer?) is.

It does seem, though, that the desync method of spreading out
updates isn't really working well enough. Maybe it should
be spread out even further, or maybe there should be something
requiring a minimal level of throughput to keep trying?

Somewhat as an aside, it's extremely frustrating to not be able to
login as root on cluster workstations when the network is flaking out
-- it is very difficult to debug network issues when forced to login
as a user that takes minutes to login, and it would be very nice to
login as a user with no network dependancies and that causes minimal
network activity on login.

--jhawk


home help back first fref pref prev next nref lref last post