[3271] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Draft of Disconnected Operation White Paper.

daemon@ATHENA.MIT.EDU (Bill Cattey)
Wed May 8 18:30:53 2002

From: Bill Cattey <wdc@MIT.EDU>
To: release-team@mit.edu
Cc: warlord@mit.edu
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Date: 08 May 2002 18:30:50 -0400
Message-Id: <1020897050.18031.34.camel@tokata.mit.edu>
Mime-Version: 1.0

I've finished the first draft of the Disconnected Operation White Paper.

Interestingly it ended up in a different place than I expected.

Please read it and send me your comments.

--wdc

----- operation.txt ----
	Issues in Disconnected Operation
	For UNIX Athena
	by Bill Cattey
	Last updated: $Date: 2002/05/08 22:22:58 $

Introduction:

UNIX Athena has traditionally been deployed on desktop systems that
were assumed always to be connected to the network.  In that
environment, loss of connectivity was considered an error condition,
with the expectation that no work would get done until connectivity
was restored.  The networking infrastructure evolved to a level of
robustness where that error was infrequent enough that no real effort
was expended in coping with disconnected operation.

Nowadays, nomadic computing, using laptop systems, is growing in
importance.  It may well be that such computing comes to dominate.
With laptop computing, the expectation is that network connectivity
will be sometimes present, and that the hardware is often put into
sleep or hybernate modes to conserve battery power, but to avoid full
system shutdowns.

This paper explores the issues in migrating UNIX Athena from the
connected desktop environment to the nomadic, sometimes disconnected
laptop environment.  This exploration consists of service levels a
user might expect for Athena in a Laptop environment, then a review of
services that are affected by the switch-over to disconnected
operation.  For each affected service there is a description of the
present service level, and of work needed to increase the level of
convenience.

The present UNIX Athena system has already been deployed on laptops.
Doing so required an enlightened user who could avoid or recover from
pathological situations.  Ordinary users would not be expected to do
that.  This paper recommends a set of changes to Athena so that it
would be deployable on a laptop owned by a non-wizard, and provide the
right set of services, and service levels.

Service Levels:

Alan Kay, in describing the ideal scenarios for use of an ideal
nomadic computing platform envisioned a computer, that could recover
from being dropped off a cliff by noticing the sudden change of
altitude, and radioing to have a replacement system parachuted over to
the user.  A similarly fanciful service level would be a nomadic
computer that would notice as it moves across internet domains of
control, and automatically change its IP address and have all services
gracefully cope.  Neither of these are what is envisioned for the
current disconnected UNIX Athena service level.  The former requires
too much infrastructure.  The latter requires too deep a re-coding of
network protocols and applications.

The overarching principle guiding the development of Athena UNIX
Disconnected operation is:

    Attempts to do ordinary things will not hang the system.

The principle gives rise to the following use cases that need to
be dealt with for every service.  The goal is that neither the system
as a whole, nor the service hang:

	When the system is powered up in connected or disconnected mode.
	When net turned on or off.
	When system suspended or resumed, with or without network present.
	When system is placed into or taken out of hybernate mode with
	     or without network present.

Handling these cases may involve:

    creation of automated housekeeping scripts hooked into the
	laptop startup, shutdown, syspend, resume, hybernate,
	re-animate, network start, and network stop routines.
    delivery of tools that give users the ability to take specific
	actions. 
    changes to the operation of certain Athena services with explicit
	expectation setting for the difference in the different
	environment.

The proposed workplan attempts a pragmatic balance between user
expectation setting, reasonable operation, and resources available to
do the work.

Services:

The following services are affected by the shift to disconnected
operation:

    Email delivery
    Network printing
    Kerberized login
    time synchronization
    Auto update
    Software delivery
    Zephyr instant messaging
    AFS File service
    Backup

Each of these services is discussed in its own subsection.

One overarching recommendation to guide all services:

Recommendation 1:

Shift explicitly from the traditional Athena paradigm of utilizing
network-based services by default or preferentially to local services 
to a nomadic paradigm of utilizing local system capabilities first,
and not going to the network unless specifically asked by the user to
do so.

Email delivery:

Email delivery is actually already a solved case.  The sendmail
program already detects when mail cannot be immediately delivered.
In that case outgoing mail is enqueued, and periodic attempts are
automatically made to deliver what is in the queue.

Note that this departs from the overarching recommendation, but is
already coded, working, and clashes only in a subtle and non harmful
way with the model.

Network printing:

Athena enhanced printing to utilize a network-based service, Hesiod,
to suplement the hard-coded file of printers and their capabilities.
The lprng subsystem that implements printing on Athena will look first
in the local file before attempting a Hesiod query, but if the user
misspells the printer name, the recovery is less than perfectly
graceful: Seeing no entry in the local printcap file, lprng will then
make a Hesiod query.  The DNS service, upon which Hesiod is based,
will take 30 seconds or so to time out.

This timeout would be perceived by the user as an application hang.
It could be remedied if the underlying DNS implementation were
augmented to listen to the proper ICMP error reports.

Recommendation 2: Live with the 30 second timeout for a while.  If it
seems significant, get someone onto the augmentation of the DNS
service.

In theory, it would be possible to make printing act in analagous
fashion to Email delivery, holding onto the job until the network comes
up and send it out then.  This would create problems when
communicating with authenticated print servers because the user's
tickets may have expired by the time the network is back up, and
communicating that fact back to a user might be tricky.

Recommendation 3: Implement a print model of, "Fail the job if we
cannot print it now, and if we cannot make contact with the remote
spooler now.

Kerberized login:

In analagous fashion to printing, Athena enhanced login to do user
account administration centrally.  Additional administrative work
beyond the UNIX norm became required to enable local accounts to
supercede the centrally administered one.

In a disconnected mode, there is a failure mode quite similar to that
of network printing: login will look first in the local passwd file,
and if there is no such person, it will use the network to
authenticate the user.  If there is no network, there will be a pause.

Athena users normally expect that when they log on, that kerberos
tickets are acquired for them automatically, and that no further
action need be taken to access secure services.  Trying to do this by
default on a sometimes disconnected system is probably the wrong
design.  It would result in long pauses if the network was
disconnected appearing to the user as a login hang.

Recommendation 4: Set explicit expectations that the default login
mode is to NOT fetch Kerberos tickets at login time.

Issue 1: Consider NOT installing the Athena login, thereby eliminating
timeouts, and explicitly eliminating expectation that kerberos
authentication, filesystem attachment, or any network service-based
initialization, or auto-update is done by default.

time synchronization:

Having a background task keep the clock in synch with an external
source is a real convenience feature.  Kerberos will not authenticate
if the client host is more than five minutes out of time synch with
the kerberos host. But if there is no network, then there is no way to
communicate with the time synchronization service.

Recommendation 5: Enable time synchronization, but remember to
properly bring it down and up when the network goes down and up.

Auto update:

One of the great values to Athena is that updates take place
automatically, while nobody is trying to get work done.  A hook in
UNIX Athena login takes care of this.

With nomadic computing, it would be a real nuisance if, moments before
someone wanted to disconnect and change locations, a time consuming
update were begun by the system.  So here too, the principle of
disabling automatic use of network services applies.

Recommendation 6: Disable auto update by default on laptops.

There is already explicit commands that a user can use to perform a
manual update.

Issue 2: Consider providing ways to notify nomadic users of the
availability of updates that go beyond the current Athena
implementations.

Software delivery:

Early UNIX Athena systems had very small local disks. Major portions
of the operating system were kept on file servers.  For such systems
it was impossible to do any work when disconnected from the network.
With the advent of big local storage, Athena is shifting the "system
packs" back to the local disk.  Under Linux Athena, there are no
"Athena System Packs".

Recommendation 7: Disconnected operation is only for Athena platforms
that put the "system packs" on the local disk.

To make it easy to maintain software, and to enable the largest number
of customers to get a working configuration, Athena chose to deliver
application software in "lockers" -- an explicitly named collection of
files, that made no assumption about what the underlying technology
was.  A user would explicitly 'attach' a locker to bring specific
functionality online and explicitly 'detach' it to take that
functionality offline.  This arrangement proved to be real boon when
different versions of application software were needed by users at
different times.  Instead of being stuck with the "version everyone
gets" the user could ask for the specific version required.

Lockers were implemented by network filesystems, so if there were no
network, the contents of the locker would be unavailable.

Recommendation 8: Make sure there are locally installable versions of
important application software that are currently available in
lockers.

Zephyr instant messaging:

On the fase of it, it seems pretty silly to expect instant messaging
to work when one is offline.  But perhaps the user has an expectation
that instant messages will behave like Email, and get enqueued.

Internally, Zephyr, the Athena instant messaging system was designed
with strong assumptions that the network was always on.  Unless
explicit communication to the zephyr server takes place, sudden loss
of connectivity is treated as a network outage or a system crash, with
the expectation that the user's subscriptions and messages should be
held onto for a while in case the user comes back.

Recommendation 8: Code appropriate tools and scripts to make an
explicit sign on and sign off of zephyr easy for the user.  Add an
explicit sign off to the network shutdown scripts.

Issue 3: Consider a paradigm of NOT signing a user onto Zephyr by
default.

AFS File service:

AFS is the network file service that provides the lion's share of
Athena data delivery and exchange.  Its ability to handle large
numbers of users, and the ease with which logically grouped files can
be replicated on multiple servers, and migrated from one server to
another have proven crucial to Athena's success.  In fact, when an
easy, secure method for publishing web pages was needed, the easiest
solution was to export AFS to the web and tell Athena users, "just
copy it into your www subdirectory".

The use of AFS in Athena has been based on the always connected
assumption.  More so than Zephyr, the AFS internals assume the network
is always connected.  For a while, AFS did not gracefully handle
shutdowns as part of the usual system shutdown process, but that
problem has finally been fixed.

If a user had his or her home directory in AFS, and unplugged the
network, the user would experience some pretty prolonged hanging.

Recommendation 9: Craft appropriate AFS shutdown and startup scripts
and hook them into the network shutdown and startup.  Test and make
sure that hibernate and suspend modes do the right thing here.

Recommendation 10: Set the explicit expectation that the user's home
directory is on the local disk on a laptop system.

Recommendation 11: Lobby for continued improvement in AFS handling
network outages gracefully.

Issue 4: Consider having AFS be explicitly turned on and off by the
user, rather than having it turned on by default when the network is
turned on.

Recommendation 12: Provide a good tool for enabling a user to
synchronize AFS file subtrees, or perhaps entire volumes to the local
disk with a good user interface to display and enable user action on
conflicts.

Backup:

Because of its focus on use of network services by default, Athena was
very slow to offer backup of files on the local disk of an Athena
workstation.  With the advent of large local disks, and now with the
expected need to restore large datasets to repaired or replaced
laptops, backup is a CRUCIAL service.

Issue 5: Think very carefully about user needs and expectations, and
how they balance against infrastructure capabilities.  Consider
offering a collection of pre-defined scripts for backup and restore of
system software, application software, and user datasets.  Consider
using a combination of AFS volume synchronization, and remote backup
solutions such as TSM.

home help back first fref pref prev next nref lref last post