[1399] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: 8.2.8 slowdown

daemon@ATHENA.MIT.EDU (Greg Hudson)
Wed Jul 29 13:32:14 1998

Date: Wed, 29 Jul 1998 13:32:07 -0400 (EDT)
From: Greg Hudson <ghudson@MIT.EDU>
To: "Ask, and it will be given you" <mbarker@MIT.EDU>
Cc: release-team@MIT.EDU, kcr@MIT.EDU, jhawk@MIT.EDU, ops@MIT.EDU,
        network@MIT.EDU
In-Reply-To: "[1398] in Release_7.7_team"

> however, do we really understand the problem?

People have been complaining about slow logins, so I think we can
pretty much safely assume that the reason machines haven't been
updating is server or network overload.  The Hesiod resolves are more
bothersome; they could be due to dropped packets (but it would have to
be a lot of dropped packets) or they could be more serious.

> And what can we do to avoid repeating it?

There are a couple of problems for OS updates:

	* We have a lot of bits to move to a lot of machines, and it
	  takes a long time.  This problem only gets better with
	  faster servers and faster networks (and more efficient
	  protocols--we could ditch AFS for updates, but I wouldn't
	  really favor that).

	* If too many machines try to update at once, the servers and
	  server networks become overloaded.  This might decrease the
	  overall efficiency of the process, and it certainly means
	  that machines spend more time inside the update process, as
	  opposed to being functional while other machines are
	  updating.  Plus it means that non-updating machines become
	  slow.

When we first implemented desynchronization, Tom Copetto argued for
instead limiting the number of simultaneously updating machines (which
doesn't require us to guess at how much updates need to be staggered).
Craig and I rejected that approach at the time because it's more
difficult and requires ASO to maintain a server, but I think it's
doable at this point given one structural change to the update.  The
general idea is:

	* update_ws becomes responsible for attaching the new packs.
	  Then we can eliminate the UPDATE_TIMESTAMP hair in
	  getcluster, and do more complicated things in update_ws for
	  desynchronizing the updates.  Also, if update_ws decides to
	  punt the update for whatever reason, the machine doesn't
	  wind up with new packs attached.

	* We have a program called /etc/athena/updatesync or
	  something.  It grabs a TCP connection to a server somewhere
	  and reads a byte.  If it gets a 1 byte, it forks off a
	  background process to hold the connection open and exits
	  successfully; otherwise, it exits with an error status.

	* update_ws (when run as auto_update) runs updatesync and
	  punts if it fails.  It will try again later.

	* After interim reboots, we run update_ws again to grab a TCP
	  connection, but we don't worry if it fails.

	* Before the final update, we run updatesync with a special
	  flag which tells it to (somehow) inform the server that it's
	  done.

	* The server (call it updatesyncd) keeps a count of machines
	  taking the update.  Losing a TCP connection means the client
	  machine rebooted, but not that it's done.  We can time out
	  such machines after M minutes if they don't grab another TCP
	  connection by then.

There's a lot of fleshing out to be done (the above description was
more detailed in places than it should have been), but I don't see any
unsurmountable obstacles.

home help back first fref pref prev next nref lref last post