[3290] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: Athena Disconnected Operation White Paper Draft 2.

daemon@ATHENA.MIT.EDU (Derek Atkins)
Fri May 24 18:37:56 2002

To: Bill Cattey <wdc@MIT.EDU>
Cc: source-developers@MIT.EDU, release-team@MIT.EDU
From: Derek Atkins <warlord@MIT.EDU>
Date: 24 May 2002 18:37:53 -0400
In-Reply-To: <1022277980.1310.80.camel@tokata.mit.edu>
Message-ID: <sjmsn4hjhoe.fsf@kikki.mit.edu>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="

--=-=-=

Speaking of which, here is my proposal.  The proposal still has
a bunch of questions, and is still limited in scope.  But I'd
like to hear feedback on the architecture and discussion on the
key open issues.

Thanks,

-derek


--=-=-=
Content-Disposition: attachment; filename=Disconnected.txt
Content-Description: Disconnected Operations Draft Proposal


Ideas on disconnected operations

Derek Atkins
2002-05-21
2002-05-24 (v1.2)

0. Introduction and Assumptions:

This document explores a framework for handling network services on a
machine that supports disconnected operations.  In particular it
focuses on status notifications to enable both the system and user
environments to handle the transition between different operational
states.  Specifically, we explore methods to start and stop both
system and user processes when the system changes state (see below for
a list of states).

While what is described herein is necessary for a clean, disconnected
Athena environment, it is specifically NOT sufficient for such an
environment.  This document does NOT explore all the ramifications of
disconnected operations, nor all the requirements for disconnected
operations.  For that discussion please see Bill Cattey's Disconnected
Operations white paper.  Herein we limit discussion solely to handling
network services cleanly in a situation where the network may go away.

We make the following assumptions about the architecture of the
service signaling framework:

* multiple users could be logged in per machine.  While an unlikely
  occurrence, one never knows how people will use machines.  It's best
  to slightly over-engineer to a tougher assumption than to
  under-engineer and have an unlikely situation actually occur.

* each user on the host can/will run a "network daemon"(1) in their
  environment.

* we can insert a script (or set of scripts) that will be run at at of
  the four state transitions:
  - suspend/hibernate
  - resume/wakeup
  - net-up
  - net-down (note that you cannot assume a clean net-down operation,
    as the user may have pulled the plug from the card or ejected a PC
    Card by hand rather that performing a soft shutdown).

The following sections describe the various pieces of the proposed
architecture to signal state transitions and handle service activity.


1. User Network Daemon:

The User Network Daemon process runs in the user's environment and
"subscribes" to state transition events.  When an event occurs, the
network daemon runs appropriate user-defined actions based upon the
type of event.  For example, when the network comes back online (a
transition to net-up), the network daemon can ask the user to renew
their tickets and then re-subscribe to zephyr, push out a print queue,
etc.

Question 1: should this application do the kerberos renewal itself
            (including asking the user for their password), or should
            it let the user do it itself (and wait for an
            acknowledgment)?

Question 2: What kind of UI should it have?  Should it have a GUI or
            command-line?  If command-line, how do you make sure that
            it has a controlling TTY for user input?


2. Architecture Options:

Standard scripts for going on-net and off-net are run in root's
environment, generally from the PCMCIA card manager daemon.  The
question is: how do you signal all the user's "network daemon"
applications that a network event has occurred?

Three obvious approaches come to mind:

 1) Poll.  Have a world-readable "state file" that gets re-written by
    the various startup/shutdown scripts, and the network daemon polls
    that "stat()" of this file and "wakes up" when the file changed.
    Unfortunately there are a number of problems with this approach.

    First, the daemons must constantly poll the file to see if it
    changed.  For any kind of responsiveness, this poll must happen at
    least once every few seconds.  This means that the application is
    sitting in a sleep loop and always waking up to stat the
    state-file, even when there is nothing to do (which would be most
    of the time).

    Second, depending on the poll interval, there can be a delay
    between the time the network is brought up and the time the daemon
    notices.  This can be offset by reducing the time between polls,
    but that increases the workload of the application.

    Third, there can be a race condition.  If the daemon is in the
    process of stating the file while data is being written, it is
    possible for the file to be only partially re-written when it is
    read by the daemon.  Granted, this is a rare case and there are a
    number of programming workarounds to solve this, but it is still
    something that should be mentioned (if for no other reason than to
    make sure the issue is covered).

 2) Use "killall".  Make sure the network daemon has a special, unique
    application name and use "killall <appname> USR1" to send a wakeup
    signal to all the applications.  Using this approach in
    combination with a state file solves both the polling and race
    condition issues.  However, if a user has another application
    similar in name to our network daemon, it is possible that the
    "killall" will signal the wrong processes.

 3) Use a "status daemon".  Create a "network status daemon" which
    sits between the network scripts and the user network daemons.
    This daemon can have a well-known PID (because there is only one
    per system), so the network scripts can signal this daemon
    directly.  The user network daemons connect to the status daemon
    via a well-known unix-domain socket.

    When the network status changes, the network scripts update the
    status file and signal the status daemon.  The status daemon sends
    a message to each attached user network daemon to wake them all
    up.  Finally the network daemons read the status file and perform
    the necessary operations for the user.

Question 3: Which architectural approach should we use?
	    [ Current proposed answer: option #3 ]


3. Scripts

The most powerful way to handle multiple services is to provide
scripts that run during the various phases, startup, shutdown,
suspend, and resume, to handle the operations of various services.

The current plan calls for the user network daemon to run a
system-wide state-change script at every event, and that script will
perform the rest of the work required to change state.  The system
script is supposed to be relatively simple.  It reads the user's
configuration, then processes the event by executing other actions.

There are a few ways to approach the action execution:

 1) One mongo script.  One large script handles all known services and
    all four event conditions.  While this is probably the easiest to
    write, it makes plugging in new services challenging.

 2) One script per state.  Four scripts, each which handle all known
    services.  This doesn't really solve any problems beyond removing
    a case statement from the script.  Each script still needs to know
    about all services, and adding new services requires changing the
    scripts.

 3) One script per service.  Each service has a single script that
    handles all four states.  This is exactly how init scripts work.
    As you add new services, you just add new scripts.

 4) One script per service per state.  Each services has four scripts,
    each script handling a particular state.  The major benefit of
    this option is the removal of the case statement to choose the
    action based on the event.

The other open issue regarding scripts is how users can override or
control the actions of global scripts?  There are a few known ways to
handle this:

 a) hooks for user scripts.  If a user script exists, run it from a
    particular place in the global script.  For example, ~/.cshrc.mine
    is called from the system cshrc at a particular place in the
    login, and ~/.environment is called from a different place in the
    login sequence.

    i) A user would have a single file that contains all the override
       variables and choices.  This single file would be called from
       the global system script near the beginning of the process.

    ii) A user would maintain a directory of scripts that get run
        after the system script.  For example, a user could maintain a
        script to do something special with the zephyr service, which
        would happen after the system's zephyr service script was run.

 b) user overrides.  If the user sets a particular variable, then the
    global script will not perform a particular actions.  For example,
    a user can set skip_lpr to have the system skip the printer setup.

For options 3 and 4 there is the obvious order-of-operations issue:
How do you make sure that the scripts are run in the proper order for
the particular event?  One possible approach would be similar to init
scripts where the scripts are named with SXX<scriptname> to enforce an
order.  In this case, user scripts would just be named <scriptname>
and would have the same ordering as the system script.


Question 4: which script approach should be used?  If 3 or 4, how do
            we handle the ordering problem?

Question 5: what is the best approach for user extensibility?  A
            related question is how would the system call user scripts
            that don't match system scripts?  Similarly, how can you
            get system and user scripts to intertwine, in particular
            if the user wants to provide hooks before or after the
            system script?  (Perhaps the user's event-handler script
            can define the user scripts?)


Assuming option 3, here is how I envision this working.  A network
event occurs.  The cardmanager scripts save the network state to a
state file and then signal the status daemon.  The state daemon wakes
up all the network daemons which handle the user environment.

Next, the network daemon wakes up and runs the system script
/usr/athena/lib/network/event-handler.  This script looks for
~/.network/event-handler and loads it if it exists.  Next, it will run
through each of the scripts in the system-wide script directory
/usr/athena/lib/network/scripts which perform the required actions.
Each script will call ~/.network/<scriptname> (if it exists) to get
the user extensions for that service.


4. Recommendations

The scalable approach would be to create the status daemon, which is
started at boot time, and have each user connect via their own network
daemon.  While this approach does require an extra daemon to sit
around, it scales to multiple users readily and it will be easier to
implement.

For scripts, the easiest approach from an implementation point of view
is a single overriding script, but from an operational point of view
it's probably best if there is either one script per service or one
script per service per state.  Whichever choice is made, the similar
choice should exist for users.  The question remains how this will
work exactly, and how it will all interact.

Athena can supply the main script and a set of standard scripts for
each of the different services, but still allow users to create their
own scripts as well and use the various override variables to control
the exact operation.  The main system script will execute the user's
"main-script," if it exists, to grab any override variables, and then
will execute the various service scripts (both system and, user),
based upon existence of scripts and the status of the overrides.


5. Conclusion

An extensible support framework for disconnected operations is easily
created using the aforementioned architecture.  In the future,
additional services may be added to the system easily, and users can
extend the system themselves.  Similarly, the network daemon could be
augmented to provide a "shut down network" icon for the user, to
simplify the interface to Linux's "cardctl", and to encourage the user
to cleanly shut down network services.

In order to build such a system, a few more requirements must be
detailed.  In particular, the various questions included herein must
be answered, and the logistics of the scripting system needs to be
better understood.

This document is a work in progress, and feedback is encouraged.

--=-=-=


-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord@MIT.EDU                        PGP key available

--=-=-=--

home help back first fref pref prev next nref lref last post