[1790] in Hotline Meeting

home help back first fref pref prev next nref lref last post

The turnin server, eos, went down today.

daemon@ATHENA.MIT.EDU (Bill Cattey)
Thu Sep 20 22:08:48 1990

Date: Thu, 20 Sep 1990 22:08:07 -0400 (EDT)
From: Bill Cattey <wdc@ATHENA.MIT.EDU>
To: hotline@ATHENA.MIT.EDU
Cc: lwvanels@ATHENA.MIT.EDU, dot@ATHENA.MIT.EDU, lavin@ATHENA.MIT.EDU,

The host eos.mit.edu, an RT running the Athena turnin service got hung
with "the 94 bug"  That's when an RT decides it's not going to do
anything except put "94" on it's LED display.

Lou took the call and was ready to come over to e40 once I told him that
there was an Athena service called turnin, and that it was running on a
host named eos, and that that host appeared to be hung.

Ron Hoffman let me in downstairs, and I rebooted the machine.
The machine came back up but had a software problem, which I was able to
correct.  (We'll fix it so that software problem doesn't occur again.)

The turnin service is back up.

----  (I don't know if the rest of this is, strictly speaking,
appropriate for hotline, but I want the information to be widely
disseminated, and I want lots of people thinking about it.) ----

There are some things that we should do to make this service more reliable:

1. Brief the people who will be answering hotline about the turnin server:
    a.  Athena is offering a new kind of network service: turnin.
    b.  It is served from exactly one host eos.mit.edu
    c.  This host is an RT gigabox in the basement of e40.
    d.  If the machine hangs, rebooting it will probably be sufficient, but
    one of the following people should be notified so they can check it:
        Bill Cattey X3-0140 926-5571
        Bruce Lewis
        Anne Lavin
        Dorothy Bowe
    (I'll have the others of us get our phone numbers to hotline as
    appropriate)
    e. If the machine explodes, there's a hot spare, eos2 which can be
    renamed eos, and have it's /etc/srvtab swapped:  Move /etc/srvtab to
    /etc/srvtab.eos2; move /etc/srvtab.eos to /etc/srvtab.

2.  We should find out if there's any way to keep "the 94 bug" from
happening too often.

3.  Is it possible to have some sort of monitor set up so that some of
us who know about turnin would receive a z-gram if the server went down?
 (I'll talk this over with sysdev people...)

4. There are software and procedureal things that don't relate to
hotline that I will also initiate.  For example, the error message that
the user got was not really indicative of the problem that occured, nor
that the right action was to call hotline about a down server.  I'll
also get some info into cfyi so that we will more quickly know if there
is a problem with turnin.

----

Special thanks to Lucien Van Elsen, who took the olc question, and knew
to come to me to get it resolved.

-wdc

home help back first fref pref prev next nref lref last post