[3375] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Rewrite of Disconnected Operation doc.

daemon@ATHENA.MIT.EDU (Bill Cattey)
Fri Jun 28 19:55:28 2002

From: Bill Cattey <wdc@MIT.EDU>
To: source-developers@MIT.EDU, release-team@MIT.EDU, sly@MIT.EDU
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Date: 28 Jun 2002 19:54:44 -0400
Message-Id: <1025308484.10561.38.camel@tokata.mit.edu>
Mime-Version: 1.0

I've rewritten the Disconnected Operations doc to remove the speculative
and fluff-ridden parts and to focus on what we're going to DO.

Heather Anne:  Pay particular attention to section 4.  It's a draft of
what to tell users happens with disconnected operation.

Share and enjoy,

-wdc

	Disconnected Operation for UNIX Athena
	by Bill Cattey
	Last updated: $Date: 2002/06/28 22:35:45 $

1. Introduction:

UNIX Athena has traditionally been deployed on desktop systems that
were assumed always to be connected to the network.  In that
environment, loss of connectivity was considered an error condition,
with the expectation that no work would get done until connectivity
was restored.  The networking infrastructure evolved to a level of
robustness where that error was infrequent enough that no real effort
was expended in coping with disconnected operation.

In response to requests to offer Athena Linux on laptop systems, this
paper describes the issues in providing graceful operation on laptops
across network disconnects, and system sleep or hibernation.  A series
of refinements to the Linux Athena release are presented with along their
rationale and impacts.

When used in a sometimes disconnected mode, many old and well
entrenched development assumptions become invalid.  Providing graceful
operation in this mode involves some compromise.  The goal is to
strike a balance between user expectations, and reasonable development
scope.  The philosophy of approach has been:  Work with what we have,
and leave room to improve subsequently in incremental steps. 


2. Service Levels:

The baseline milestone for Athena UNIX Disconnected operation is:

    Attempts to do ordinary things will not hang the system.

Beyond the baseline are enhancements to provide more functionality
with less user intervention.

The following are the use cases that define disconnected operation:

	System power up with or without network present.
	Network turned on or off.
	System suspend and resume, with or without network.
	System entry or exit from hibernate mode with or without
	    network present.

Handling these cases may involve:

    creation of automated housekeeping scripts hooked into the
	laptop startup, shutdown, suspend, resume, hibernate,
	re-animate, network start, and network stop routines.
    delivery of tools that give users the ability to take specific
	actions. 
    changes to the operation of certain Athena services with explicit
	expectation setting for the difference in the different
	environment.
    documentation and training to help customers and support personnel
	recognize and deal with non obvious looking behavior.


3. Services:

The following services are affected by the shift to disconnected
operation:

    1. Name service
    2. Email delivery
    3. Network printing
    4. Kerberized login
    5. time synchronization
    6. Auto update
    7. Software delivery
    8. Zephyr instant messaging
    9. AFS File service
    10. Backup

Each of these services is discussed in its own subsection.

One over-arching recommendation to guide all services:

Recommendation 1: Prefer local to remote.

Shift explicitly from the traditional Athena paradigm of utilizing
network-based services by default or preferentially to local services 
to a portable paradigm of utilizing local system capabilities first,
and not going to the network unless specifically asked by the user to
do so.


3-1. Name service:

Many network services rely on resolution of names via a network
service.  It is frequently the case that the name service takes half a
minute real-time to decide that there will be no response to a name
resolution request.  Higher level services like web browsing, file
service, etc. tend to pause while waiting for this name service
timeout.  

Recommendation 1:  Investigate name service improvement.

Producing a configuration that reliably resolves names, but does not
incur this delay is not obvious.  If such a configuration becomes
possible, it will be incorporated.  Until then, documentation should
warn users that sometimes there will be a 30 second pause if an
attempt is made to do something with the network when no network is
present.

Another possibility would be modifying the name services implementation
to listen to the proper ICMP error reports.  A rewrite of the name
service seems like too large a development task at this time.


3-2. Email delivery:

Email delivery enqueues outgoing mail if the network is not present.
However email delivery exhibits two problematic cases:  

1. Some email clients attempt to validate the outgoing host name, and
   the name service hangs for 30 seconds timing out in that validation
   request.

2. The sendmail daemon runs periodically rather than explicitly
   noticing the re-activation of the network connection.  This means
   that either the user experiences a time-delay or the user
   explicitly must run the daemon to empty the queue.

It may be that users will just have to live with 30 second timeouts
for a while, but some thought should be given to making the name
service behave more gracefully.

Recommendation 2: Implement automatic email queue emptying.

It should be easy to detect network state changes and to have that
automatically empty the email queue.


3-3. Network printing:

Athena enhanced printing to utilize a network-based service, Hesiod,
to supplement the hard-coded file of printers and their capabilities.
The lprng subsystem that implements printing on Athena will look first
in the local file before attempting a Hesiod query, but if the user
misspells the printer name, the recovery is less than perfectly
graceful: Seeing no entry in the local printcap file, lprng will then
make a Hesiod query.  The name service, upon which Hesiod is based,
will take 30 seconds to time out unless we get an implementation of
recommendation 1.

There are two approaches to dealing with network printing in the
absence of the network:  fail the job, or enqueue it.

Failing the job is easy to implement, and conceptually simple.  But it
may be that users experience significant inconvenience if it took
significant work to produce the print job.  Usability testing should
answer the question:

Question 1: Do users expect to produce a file to print, and expect
instant feedback on success of the job, or would they prefer a
persistent queue like the outgoing Email is done?

It has been reported that under Windows, users expect print jobs to be
queued and have a graphical interface to see pending job status.  It
has been reported that under UNIX, it is commonplace for users to
craft somewhat elaborate command lines to print rather than crafting a
file which is later printed.

Recommendation 3:  Queue the job in version 2.

To minimize the development and time to deploy, the configuration will
be set and validated to make sure that the print job fails simply and
clearly.  We will measure user response to this implementation, and
plan as a follow-on a queuing system that also presents a user with a
GUI status display if demand justifies the cost.

The implementation of such a queuing system might be something like
this: We would write our own little mini-queuing software.  (Stock
lpr could not handle the queuing in the kerberized case.) lpr would
write to our little queue when the net was off.  When the net is back
up, and the user has tickets, a little GUI would confirm that the user
still wanted to print the queued jobs, or offer the ability to remove
them from the list, and then send them off to the real queue.


3-4. Kerberized login:

In analogous fashion to printing, Athena enhanced login to do user
account administration centrally.  Additional administrative work
beyond the UNIX norm became required on Athena systems to enable local
accounts to supersede the centrally administered one.

In a disconnected mode, there is a failure mode quite similar to that
of network printing: login will look first in the local passwd file,
and if there is no such person, it will use the network to
authenticate the user.  If there is no network, there will be a pause.

Athena users normally expect that when they log on, that kerberos
tickets are acquired for them automatically, and that no further
action need be taken to access secure services.  Trying to do this by
default on a sometimes disconnected system is possible, though.

Recommendation 4: Get tickets when you can.

Login should be configured to be as graceful as possible about
recovering from network outages.  It would not be difficult to have a
user daemon listening for network status events that would be aware of
the user's tickets.  Some versions of UNIX Athena pop up a warning
alerting users of impending kerberos ticket expiration.  (This was
done by Dash in older releases, and will be done by authwatch in UNIX
Athena 9.1.)  The three use cases would be:

    Network comes up, and user has no tickets.
    Network comes up and tickets are near renewal time.
    Network comes up with a different IP address invalidating kerberos
        4 tickets.

Security and usability testing should answer the question:

Question 2: Should users be prompted for a password to renew their
tickets, or should they be prompted to run the kerberos ticket
fetching utility?


3-5. Time synchronization:

Having a background task keep the clock in sync with an external
source is a real convenience feature.  Kerberos will not authenticate
if the client host is more than five minutes out of time sync with
the kerberos host. But if there is no network, then there is no way to
communicate with the time synchronization service.

Simply setting the time by hand can be problematic.  UNIX systems
don't like it if time goes backward.  The ntp daemon deals with this
case by stopping time a little bit over a long while to get a system
clock that's crept ahead back in sync.

Recommendation 5: Enable time synchronization.

This too is a case where appropriate daemons can be created that act
appropriately on network events.

In the cases of resume from suspend, and re-animate from hibernate,
the clock often drifts a a lot, and needs correction, sometimes
to an earlier time.  Since UNIX has been on hiatus, it does not notice
that time went backward.  Setting the time rather than backing up
slowly is fine.


3-6. Auto update:

One of the great values to Athena is that updates take place
automatically, while nobody is trying to get work done.  

With portable computing, it would be a real nuisance if, moments before
someone wanted to disconnect and change locations, a time consuming
update were begun by the system.  Disabling auto update is one way to
handle this situation, but it has been observed that as soon as
updates require user intervention, they stop happening.  Users, for
the most part, don't keep their systems up to date.

We propose to enhance Athena Linux update in phases:

Recommendation 6: Cache auto update locally for laptops.

Amend the current update to fetch all the new packages to local disk
before doing any update.  If there is insufficient room, the update
fails with error reports. Use a new rc.conf variable to make this
behavior apply only to DISCONNECTABLE configurations.

A further refinement would be to perform updates in phases to allow
continuing an interrupted update and to reduce the amount of local
disk required to perform an update.

Recommendation 7: Create a hot-fix tier of updates.

There are system attacks that can result in harm well beyond the
single system attacked.  A mechanism should be created to allow
automatic update with fixes to block such attacks.  A disciplined set
of fixes that are carefully written so that they take little time to
install, and extra rigorously tested to make sure that nobody is
disrupted by a bug in the fix.  Such an auto-update system would be
driven like the kerberos ticket acquisition system:

       When the net comes up check for required fixes.
       Fetch the required fixes.
       If and only if the fetch completes, install the fixes.
       If a reboot is required, warn the user that a security
	  fix has been installed requiring same.

A slightly different approach would be to organize the Athena updates
into separate tracks with required hot-fixes being small and quick and
required, and other tracks being available to opt into.  Possible
example tracks: OS updates, Application updates, new feature updates.

Recommendation 8: Consider Pop-up announcement of new versions.

Although pop-ups are annoying to users, one way to create an incentive
to update is with annoying pop-ups every time the network comes back
alerting the user that an update is available and that running the
update_ws utility is strongly recommended.  Producing this is low
effort.

We will need to monitor the update situation and determine:

Question 3: Do a significant number of users keep their laptops too full to
permit updates?

Question 4: Do a significant number of users stay logged in and active on
their laptops such that auto update is not useful?

Question 5: Do users get the appropriate sorts of notification and incentive
for keeping their systems up to date?


3-7. Software delivery:

Early UNIX Athena systems had very small local disks. Major portions
of the operating system were kept on file servers.  For such systems
it was impossible to do any work when disconnected from the network.
With the advent of big local storage, Athena is shifting the "system
packs" back to the local disk.  Under Linux Athena, there are no
"Athena System Packs".

Recommendation 9: Disconnected operation is only for Athena platforms
that put the "system packs" on the local disk.

To make it easy to maintain software, and to enable the largest number
of customers to get a working configuration, Athena chose to deliver
application software in "lockers" -- an explicitly named collection of
files, that made no assumption about what the underlying technology
was.  A user would explicitly 'attach' a locker to bring specific
functionality online and explicitly 'detach' it to take that
functionality offline.  This arrangement proved to be real boon when
different versions of application software were needed by users at
different times.  Instead of being stuck with the "version everyone
gets" the user could ask for the specific version required.

Lockers were implemented by network filesystems, so if there were no
network, the contents of the locker would be unavailable.

Recommendation 10: Create installers for locker software.

Make sure there are locally installable versions of
important application software that are currently available in
lockers.


3-8. Zephyr instant messaging:

Zephyr under Windows already gracefully handles network disconnection
and reconnection by enqueuing outgoing messages, and renewing
subscriptions when the network comes back up.  This sort of
functionality could be added to the UNIX platform, leveraging off the
ticket renewal recommendation #4.

Internally, UNIX Zephyr was designed with strong assumptions that the
network was always on.  Unless explicit communication to the zephyr
server takes place, sudden loss of connectivity is treated as a
network outage or a system crash, with the expectation that the user's
subscriptions and messages should be held onto for a while in case the
user comes back.

Recommendation 11: Create appropriate zephyr start-up and shut-down
scripts for the disconnected use cases.

It may be that outgoing messages cannot be enqueued with the present
implementation of UNIX Zephyr.  If so, then we should only implement
the queuing if there is strong user demand.  Auto-renewing of
subscriptions should be possible without user intervention without a
lot of effort.

3-9. AFS File service:

AFS is the network file service that provides the lion's share of
Athena data delivery and exchange.  Its ability to handle large
numbers of users, and the ease with which logically grouped files can
be replicated on multiple servers, and migrated from one server to
another have proven crucial to Athena's success.  In fact, when an
easy, secure method for publishing web pages was needed, the easiest
solution was to export AFS to the web and tell Athena users, "just
copy it into your www subdirectory".

The use of AFS in Athena has been based on the always connected
assumption.  More so than Zephyr, the AFS internals assume the network
is always connected.  

If a user had his or her home directory in AFS, and unplugged the
network, the user would experience some pretty prolonged hanging.

Recommendation 12: Craft appropriate AFS shut-down and start-up
scripts for the disconnected use cases.

Recommendation 13: Set the explicit expectation that the user's home
directory is on the local disk on a laptop system.

Recommendation 14: Lobby for continued improvement in AFS handling
network outages gracefully.

This is an effort akin to the name server work called for in
recommendation #1 it is something we may not be in control of.
Setting user expectations may be the only recourse we have here for a
while.

Recommendation 15: Document for, train for, and encourage use of a
local home directory on laptops, and sensible approaches users can use
to keep them in sync.


3-10. Backup:

Because of its focus on use of network services by default, Athena was
very slow to offer backup of files on the local disk of an Athena
workstation.  With the advent of large local disks, and now with the
expected need to restore large datasets to repaired or replaced
laptops, backup is a CRUCIAL service.

Recommendation 16: Make sure backup is usable and functional.

We should think very carefully about user needs and expectations, and
how they balance against infrastructure capabilities.  Consider
offering a collection of pre-defined scripts for backup and restore of
system software, centrally licensed application software,
user-purchased application software, and user datasets.  Consider
using a combination of AFS volume synchronization, and remote backup
solutions such as TSM.

4. Documentation and Training:

The following behavior needs to be mentioned in Athena documentation
and training:

If you are using Athena on a system that is not always connected to
the network and/or changes network address, you should be aware of the
following:

If you attempt an action that requires the network, some programs may
take as long as half a minute before they're sure that the network
really is not present.  During that time such programs will appear
hung.  Be patient for the first half a minute or so before worrying
that there is a serious problem.

When you move your machine, your network address will probably change.
If it does, you will be asked to renew your kerberos tickets.

If you attempt printing to a network printer while you are
disconnected from the network, your job will not be saved for
later. It will be rejected, and you will receive an alert telling you
so.

If you are a power zephyr user, and use zctl sub to set subscriptions
only for the duration of your login session.  When you change
location, those subscriptions will be lost.

Remember that you won't have access to AFS so you need to:

Be careful when you run locker software.  When you disconnect from the
network, software started out of an AFS locker will most likely just
die.  Save your work and shut down the programs yourself before you
put your laptop to sleep or shut down your network connection.

Keep your home directory local on your machine, and copy files between
it and your AFS directories.

If you try to do something in AFS while the network is down, that
activity will hang for about two minutes until various bits of AFS
convince themselves that AFS really is off the network.  We recognize
this is not ideal, and are working to improve the situation in the
future.

5. Implementation

Here is the implementation as of June 2002.

Infrastructure:

The following infrastructure bits are to be added to the Athena
release to help implement the configurations described in the specific
areas below.

A new flag is added to /etc/athena/rc.conf: DISCONNECTABLE
This flag is used to tailor the standard Athena release for graceful
operation in a disconnected mode.  It is false by default, and must be
set to true on laptops.

A new system daemon, athstatusd has been added to run levels 2, 3, 4,
and 5.  This daemon is responsible for listening for network startup
and shutdown events, and dispatching root-level scripts to take
appropriate action.  athstatusd also signals a user level daemon called
neteventd.

A user's login init files will start neteventd.  This daemon listens
for news from athstatusd and dispatches user-level scripts to take
appropriate action.

Having these two cooperating daemons makes it easy to do things "as the
user" and "as the root" where appropriate, and neatly finesses tricky
issues that would ordinarily arise if there were multiple users logged
in, or if the root needed things in the user's environment like the
AFS PAG.


5-1. Name service:

There are currently no changes to name service.  We just wait for the
timeouts.

We will investigate using athstatusd to switch to a resolv.conf that
redirects name service internally when the network is off.


5-2. Email delivery:

The athstatusd will, when it sees the network has come up, run
sendmail -q and empty any enqueued mail.

We are considering running a local sendmail daemon if DISCONNECTABLE is
true.  If users configured their email clients to send mail to
localhost instead of the usual config of outgoing.mit.edu, there would
be no risk of pauses or lost outgoing email if the user's clients are
ill-behaved.

Testing with netscape will be done to confirm that reasonable
behavior results.  Evolution, pine, mh, and emacs mail all use
sendmail rather than contacting a defined outgoing mail host.


5-3. Network printing: 

Testing is to be done to confirm that jobs are rejected in a timely
manner with comprehensible alert.

We will monitor user feedback to get insight into:

Question 1: Do users expect to produce a file to print, and expect
instant feedback on success of the job, or would they prefer a
persistent queue like the outgoing Email is done?


5-4. Kerberized login: 

Testing will confirm that logging in with network connected will get
tickets, and will not take extra time to timeout with network
disconnected.

athstatusd, when it sees the network has come up will prompt the user
to renew tickets if they are expired, or if the system's IP address
has changed.  The prompting will be done by calling out to grenew, a
program in Athena 9.1 that will pop up a window if it is present, or
use the login tty if no X is running to prompt for a password.

We understand there is an open question to continue to ponder:

Question 2: Should users be prompted for a password to renew their
tickets, or should they be prompted to run the kerberos ticket
fetching utility?

It is recognized that having users type a password into a pop-up is
non-ideal policy.  It is preferable in this case because it makes for
a simpler user experience as well as being easier to code.  There are
ticket-requiring activities like refreshing zephyr subscriptions that
will block until tickets are renewed.  Allowing the user to
asynchronously renew tickets and then having those activities fire off
may create confusion.

If it is determined that asynchronous renewal of tickets is required,
then we'll implement the policy by having renew and grenew signal
athstatusd or neteventd.



5-5. time synchronization: 

athstatusd, when it sees the network has come up, will reset the time
if the clock is more than 240 seconds off from the time set by
time.mit.edu.

ntpd will run as it does on regular Athena UNIX systems.


5-6. Auto update: 

AUTOUPDATE will be set to true by default.  The Athena update script
will have different behavior if DISCONNECTABLE is true:  It will
verify that adequate space is available on the local disk, and fetch
all new RPMs of the update, and confirm they have been copied in
correctly before beginning an update.

We will monitor the update situation and try and get insights into the
questions:

Question 3: Do a significant number of users keep their laptops too full to
permit updates?

Question 4: Do a significant number of users stay logged in and active on
their laptops such that auto update is not useful?

Question 5: Do users get the appropriate sorts of notification and incentive
for keeping their systems up to date?


5-7. Software delivery: 

At the present time there are no explicit plans for installers for
local software.  We will monitor demand.  For software in demand on
off-net laptops for which licenses permit, we will produce installers.
This activity will be coordinated through the Software Release Team.


5-8. Zephyr instant messaging: 

zhm startup will be modified so that on systems with DISCONNECTABLE
set to true, will be started by athstatusd rather than the ordinary
init files.

athstatusd will signal neteventd when the network comes up to use zctl
load to reload subscriptions for the user.


5-9. AFS File service: 

Athena 9.1 will ship with OpenAFS using the new dynroot and afsdb
features which enable AFS to start gracefully without network present,
and prevent long hangs when file browsers decide they want to list
/afs.

The support is not perfect.  When users try do do something in afs
when the network is offline the user's shell will hang for about two
minutes until AFS times out.


5-10. Backup: 

TSM will be tested to make sure that documentation and performance are
adequate to the task of working with user datasets on laptops.



home help back first fref pref prev next nref lref last post