[3289] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Athena Disconnected Operation White Paper Draft 2.

daemon@ATHENA.MIT.EDU (Bill Cattey)
Fri May 24 18:06:21 2002

From: Bill Cattey <wdc@MIT.EDU>
To: source-developers@mit.edu
Cc: release-team@mit.edu
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Date: 24 May 2002 18:06:20 -0400
Message-Id: <1022277980.1310.80.camel@tokata.mit.edu>
Mime-Version: 1.0

I sent the first draft of this document to the Athena Release Team and
got lots of good feedback which I believe I've integrated into the
document. Perhaps a more correct audience is the Source Developers list.

I'm interested in hearing feedback on the recommendations and an
estimate of what it would take to implement them.  In fact, as a
follow-on, Derek Atkins has a draft of an implementation proposal.

So, let the discussion begin!

-wdc

---- enclosure operation.txt ----
	Issues in Disconnected Operation
	For UNIX Athena
	by Bill Cattey
	Last updated: $Date: 2002/05/24 20:43:29 $

Introduction:

UNIX Athena has traditionally been deployed on desktop systems that
were assumed always to be connected to the network.  In that
environment, loss of connectivity was considered an error condition,
with the expectation that no work would get done until connectivity
was restored.  The networking infrastructure evolved to a level of
robustness where that error was infrequent enough that no real effort
was expended in coping with disconnected operation.

Nowadays, portable computing, using laptop systems, is growing in
importance.  It may well be that such computing comes to dominate.
With laptop computing, the expectation is that network connectivity
will be sometimes present, and that the hardware is often put into
sleep or hibernate modes to conserve battery power, but to avoid full
system shutdowns.

This paper explores the issues in migrating UNIX Athena from the
connected desktop environment to the portable, sometimes disconnected
laptop environment.  This exploration consists of service levels a
user might expect for Athena in a Laptop environment, then a review of
services that are affected by the switch-over to disconnected
operation.  For each affected service there is a description of the
present service level, and of work needed to increase the level of
convenience.

The present UNIX Athena system has already been deployed on laptops.
Doing so required an enlightened user who could avoid or recover from
pathological situations.  Ordinary users would not be expected to do
that.  This paper recommends a set of changes to Athena so that it
would be deployable on a laptop owned by a non-wizard, and provide the
right set of services, and service levels.

Service Levels:

Alan Kay, in describing the ideal scenario for use of an ideal
nomadic computing platform, envisioned a computer that could recover
from being dropped off a cliff by noticing the sudden change of
altitude, and radioing to have a replacement system parachuted over to
the user.  A similarly fanciful service level would be a portable
computer that would notice as it moves across internet domains of
control, and automatically change its IP address and have all services
gracefully cope.  Neither of these are what is envisioned for the
current disconnected UNIX Athena service level.  The former requires
too much infrastructure.  The latter requires too deep a re-coding of
network protocols and applications.

The baseline milestone for Athena UNIX Disconnected operation is:

    Attempts to do ordinary things will not hang the system.

Beyond the baseline are enhancements to provide more functionality
with less user intervention.

The following are the use cases that define disconnected operation:

	System power up with or without network present.
	Network turned on or off.
	System suspended and resume, with or without network.
	System entry or exit from hibernate mode with or without
	    network present.

Handling these cases may involve:

    creation of automated housekeeping scripts hooked into the
	laptop startup, shutdown, suspend, resume, hibernate,
	re-animate, network start, and network stop routines.
    delivery of tools that give users the ability to take specific
	actions. 
    changes to the operation of certain Athena services with explicit
	expectation setting for the difference in the different
	environment.

The proposed work plan attempts a pragmatic balance between user
expectation setting, reasonable operation, and resources available to
do the work.

Services:

The following services are affected by the shift to disconnected
operation:

    Email delivery
    Network printing
    Kerberized login
    time synchronization
    Auto update
    Software delivery
    Zephyr instant messaging
    AFS File service
    Backup

Each of these services is discussed in its own subsection.

One over-arching recommendation to guide all services:

Recommendation 1: Prefer local to remote.

Shift explicitly from the traditional Athena paradigm of utilizing
network-based services by default or preferentially to local services 
to a portable paradigm of utilizing local system capabilities first,
and not going to the network unless specifically asked by the user to
do so.

Email delivery:

Email delivery enqueues outgoing mail if the network is not present.
However email delivery exhibits two problematic cases:  

1. Some email clients attempt to validate the outgoing host name, and
   the DNS service hangs for a while timing out in that validation
   request.

2. The sendmail demon runs periodically rather than explicitly
   noticing the re-activation of the network connection.  This means
   that either the user experiences a time-delay or the user
   explicitly must run the demon to empty the queue.

Recommendation 1:  Investigate DNS improvement.

It may be that users will just have to live with 30 second timeouts
for a while, but some thought should be given to making the DNS
service behave more gracefully.

Recommendation 2: Implement automatic email queue emptying.

It should be easy to detect network state changes and to have that
automatically empty the email queue.

Network printing:

Athena enhanced printing to utilize a network-based service, Hesiod,
to supplement the hard-coded file of printers and their capabilities.
The lprng subsystem that implements printing on Athena will look first
in the local file before attempting a Hesiod query, but if the user
misspells the printer name, the recovery is less than perfectly
graceful: Seeing no entry in the local printcap file, lprng will then
make a Hesiod query.  The DNS service, upon which Hesiod is based,
will take 30 seconds or so to time out.

This timeout would be perceived by the user as an application hang.
It could be remedied if the underlying DNS implementation were
augmented to listen to the proper ICMP error reports. Implementing
recommendation 1 would ease this situation.

There are two approaches to dealing with network printing in the
absense of the network:  fail the job, or enqueue it.

Failing the job is easy to implement, and conceptually simple.  But it
may be that users experience significant inconvenience if it's
significant work to produce the print job.  Usability testing should
answer the question:

Question 1: Do users expect to produce a file to print, and expect
instant feedback on success of the job, or would they prefer a
persistent queue like the outgoing Email is done?

It has been reported that under Windows, users expect print jobs to be
queued and have a graphical interface to see pending job status.  It
has been reported that under UNIX, it is commonplace for users to
craft somewhat elaborate command lines to print rather than crafting a
file which is later printed.

Recommendation 3:  Queue the job in version 2.

To minimize the development and time to market, make a few changes to
make sure that the print job fails simply and clearly.  Measure user
response to this implementation, and plan as a follow-on a queuing
system that also presents a user with a GUI status display.

Kerberized login:

In analogous fashion to printing, Athena enhanced login to do user
account administration centrally.  Additional administrative work
beyond the UNIX norm became required on Athena systems to enable local
accounts to supersede the centrally administered one.

In a disconnected mode, there is a failure mode quite similar to that
of network printing: login will look first in the local passwd file,
and if there is no such person, it will use the network to
authenticate the user.  If there is no network, there will be a pause.

Athena users normally expect that when they log on, that kerberos
tickets are acquired for them automatically, and that no further
action need be taken to access secure services.  Trying to do this by
default on a sometimes disconnected system is possible, though.

Recommendation 4: Get tickets when you can.

Login should be tuned up to be as graceful as possible about
recovering from network outages.  It would not be difficult to have a
user demon listening for network status events that would be aware of
the user's tickets.  This is a variation on the old "kerberometer"
theme where a window would pop up warning the user that ticket renewal
was required.  The two use cases would be:

    Network comes up, and user has no tickets.
    Network comes up and tickets are near renewal time.

Security and usability testing should answer the question:

Question 2: Should users be prompted for a password to renew their
tickets, or should they be prompted to run the kerberos ticket
fetching utility?

The Windows and Mac Leash user interface should probably set the
standard of usability here.

Time synchronization:

Having a background task keep the clock in sync with an external
source is a real convenience feature.  Kerberos will not authenticate
if the client host is more than five minutes out of time sync with
the kerberos host. But if there is no network, then there is no way to
communicate with the time synchronization service.

Simply setting the time by hand is problematic.  UNIX systems to NOT
gracefully handle time going backwards.  The ntp demon deals with this
case by stopping time a little bit over a long while to get a system
clock that's crept ahead back in sync.

Recommendation 5: Enable time synchronization.

This too is a case where appropriate demons can be created that act
appropriately on network events.

Auto update:

One of the great values to Athena is that updates take place
automatically, while nobody is trying to get work done.  A hook in
UNIX Athena login takes care of this.

With portable computing, it would be a real nuisance if, moments before
someone wanted to disconnect and change locations, a time consuming
update were begun by the system.  So here too, the principle of
disabling automatic use of network services applies.

Recommendation 6: Disable auto update by default on laptops.

There is already explicit commands that a user can use to perform a
manual update.

Requiring users to explicitly update means that there will need to be
strong incentive for users to take this action.  Otherwise it is to be
expected that users will not do it.  (There are documented cases of
students losing doctoral dissertations owing to failure to take steps
to back up their files.)

Recommendation 7: Pop-up announcement of new versions.

Although pop-ups are annoying to users, one way to create an incentive
to update is with annoying pop-ups every time the network comes back
alerting the user that an update is available and that running the
update_ws utility is strongly recommended.  Producing this is low
effort.

Recommendation 8: Create a hot-fix tier of updates.

There are system attacks that can result in harm well beyond the
single system attacked.  A mechanism should be created to allow
automatic update with fixes to block such attacks.  A disciplined set
of fixes that are carefully written so that they take little time to
install, and extra rigorously tested to make sure that nobody is
disrupted by a bug in the fix.  Such an auto-update system would be
driven like the kerberos ticket acquisition system:

       When the net comes up check for required fixes.
       Fetch the required fixes.
       If and only if the fetch completes, install the fixes.
       If a reboot is required, warn the user that a security
	  fix has been installed requiring same.

Software delivery:

Early UNIX Athena systems had very small local disks. Major portions
of the operating system were kept on file servers.  For such systems
it was impossible to do any work when disconnected from the network.
With the advent of big local storage, Athena is shifting the "system
packs" back to the local disk.  Under Linux Athena, there are no
"Athena System Packs".

Recommendation 9: Disconnected operation is only for Athena platforms
that put the "system packs" on the local disk.

To make it easy to maintain software, and to enable the largest number
of customers to get a working configuration, Athena chose to deliver
application software in "lockers" -- an explicitly named collection of
files, that made no assumption about what the underlying technology
was.  A user would explicitly 'attach' a locker to bring specific
functionality online and explicitly 'detach' it to take that
functionality offline.  This arrangement proved to be real boon when
different versions of application software were needed by users at
different times.  Instead of being stuck with the "version everyone
gets" the user could ask for the specific version required.

Lockers were implemented by network filesystems, so if there were no
network, the contents of the locker would be unavailable.

Recommendation 10: Make sure there are locally installable versions of
important application software that are currently available in
lockers.

Zephyr instant messaging:

Zephyr under Windows already gracefully handles network disconnection
and reconnection by enqueuing outgoing messages, and renewing
subscriptions when the network comes back up.  This sort of
functionality could be added to the UNIX platform, leveraging off the
ticket renewal recommendation #4.

Internally, UNIX Zephyr was designed with strong assumptions that the
network was always on.  Unless explicit communication to the zephyr
server takes place, sudden loss of connectivity is treated as a
network outage or a system crash, with the expectation that the user's
subscriptions and messages should be held onto for a while in case the
user comes back.

Recommendation 11: Create appropriate zephyr start-up and shut-down
scripts for the disconnected use cases.

It may be that outgoing messages cannot be enqueued with the present
implementation of UNIX Zephyr.  If so, then we should only implement
the queuing if there is strong user demand.  Auto-renewing of
subscriptions should be possible without user intervention without a
lot of effort.

AFS File service:

AFS is the network file service that provides the lion's share of
Athena data delivery and exchange.  Its ability to handle large
numbers of users, and the ease with which logically grouped files can
be replicated on multiple servers, and migrated from one server to
another have proven crucial to Athena's success.  In fact, when an
easy, secure method for publishing web pages was needed, the easiest
solution was to export AFS to the web and tell Athena users, "just
copy it into your www subdirectory".

The use of AFS in Athena has been based on the always connected
assumption.  More so than Zephyr, the AFS internals assume the network
is always connected.  For a while, AFS did not gracefully handle
shutdowns as part of the usual system shutdown process, but that
problem has finally been fixed.

If a user had his or her home directory in AFS, and unplugged the
network, the user would experience some pretty prolonged hanging.

Recommendation 12: Craft appropriate AFS shut-down and start-up
scripts for the disconnected use cases.

Recommendation 13: Set the explicit expectation that the user's home
directory is on the local disk on a laptop system.

Recommendation 14: Lobby for continued improvement in AFS handling
network outages gracefully.

This is an effort akin to the DNS work called for in recommendation #1
it is something we may not be in control of.  Setting user
expectations may be the only recourse we have here for a while.

Recommendation 15: Write a decent tool to synchronize a local
file hierarchy with an AFS file hierarchy.

Such a tool might take functionality clues from diff3, voldump,
synctree, track, and other command line and graphical file hierarchy
reconciliation tools.

Backup:

Because of its focus on use of network services by default, Athena was
very slow to offer backup of files on the local disk of an Athena
workstation.  With the advent of large local disks, and now with the
expected need to restore large datasets to repaired or replaced
laptops, backup is a CRUCIAL service.

Recommendation 16: Make sure backup is usable and functional.

We should think very carefully about user needs and expectations, and
how they balance against infrastructure capabilities.  Consider
offering a collection of pre-defined scripts for backup and restore of
system software, centrally licensed application software,
user-purchased application software, and user datasets.  Consider
using a combination of AFS volume synchronization, and remote backup
solutions such as TSM.


home help back first fref pref prev next nref lref last post