[1399] in Release_7.7_team
Re: 8.2.8 slowdown
daemon@ATHENA.MIT.EDU (Greg Hudson)
Wed Jul 29 13:32:14 1998
Date: Wed, 29 Jul 1998 13:32:07 -0400 (EDT)
From: Greg Hudson <ghudson@MIT.EDU>
To: "Ask, and it will be given you" <mbarker@MIT.EDU>
Cc: release-team@MIT.EDU, kcr@MIT.EDU, jhawk@MIT.EDU, ops@MIT.EDU,
network@MIT.EDU
In-Reply-To: "[1398] in Release_7.7_team"
> however, do we really understand the problem?
People have been complaining about slow logins, so I think we can
pretty much safely assume that the reason machines haven't been
updating is server or network overload. The Hesiod resolves are more
bothersome; they could be due to dropped packets (but it would have to
be a lot of dropped packets) or they could be more serious.
> And what can we do to avoid repeating it?
There are a couple of problems for OS updates:
* We have a lot of bits to move to a lot of machines, and it
takes a long time. This problem only gets better with
faster servers and faster networks (and more efficient
protocols--we could ditch AFS for updates, but I wouldn't
really favor that).
* If too many machines try to update at once, the servers and
server networks become overloaded. This might decrease the
overall efficiency of the process, and it certainly means
that machines spend more time inside the update process, as
opposed to being functional while other machines are
updating. Plus it means that non-updating machines become
slow.
When we first implemented desynchronization, Tom Copetto argued for
instead limiting the number of simultaneously updating machines (which
doesn't require us to guess at how much updates need to be staggered).
Craig and I rejected that approach at the time because it's more
difficult and requires ASO to maintain a server, but I think it's
doable at this point given one structural change to the update. The
general idea is:
* update_ws becomes responsible for attaching the new packs.
Then we can eliminate the UPDATE_TIMESTAMP hair in
getcluster, and do more complicated things in update_ws for
desynchronizing the updates. Also, if update_ws decides to
punt the update for whatever reason, the machine doesn't
wind up with new packs attached.
* We have a program called /etc/athena/updatesync or
something. It grabs a TCP connection to a server somewhere
and reads a byte. If it gets a 1 byte, it forks off a
background process to hold the connection open and exits
successfully; otherwise, it exits with an error status.
* update_ws (when run as auto_update) runs updatesync and
punts if it fails. It will try again later.
* After interim reboots, we run update_ws again to grab a TCP
connection, but we don't worry if it fails.
* Before the final update, we run updatesync with a special
flag which tells it to (somehow) inform the server that it's
done.
* The server (call it updatesyncd) keeps a count of machines
taking the update. Losing a TCP connection means the client
machine rebooted, but not that it's done. We can time out
such machines after M minutes if they don't grab another TCP
connection by then.
There's a lot of fleshing out to be done (the above description was
more detailed in places than it should have been), but I don't see any
unsurmountable obstacles.