[277] in Project_DB

home help back first fref pref prev next nref lref last post

Fwd: Needless disruption of Project DB last week by O&S.

daemon@ATHENA.MIT.EDU (Bill Cattey)
Mon Jan 12 15:32:25 1998

Date: Mon, 12 Jan 1998 15:32:14 -0500 (EST)
From: Bill Cattey <wdc@MIT.EDU>
To: project-db@MIT.EDU

I wanted to share with the PDB team the strongly, but professionally
worded note I sent to O&S chastising them for breaking the Project
Database.  I plan to keep it confidential which member of the team
suggested pulling Arachne out of W91.  (Although I'm inclined to
agree...)

-wdc

Date: Mon, 12 Jan 1998 15:20:04 -0500 (EST)
From: Bill Cattey <wdc@MIT.EDU>
To: o&s-facmgt@MIT.EDU
Subject: Needless disruption of Project DB last week by O&S.
Cc: rar@MIT.EDU, rferrara@MIT.EDU

Last week on Thursday 8 January at 14:52 PM Brenda Gillingham reported
that the Project Database was not working.

I put Bruce Lewis onto finding out the cause, and after half a day of
effort, he reported the next day:

    As of 10:55 a.m. today, access is restored.

    For those curious about the technical reason for the outage: Someone
    (neither miki nor I) made /usr/local non-world-readable.  This caused
    scheme to fail to run, so the BRL process kept exiting and being
    restarted.

The consequences of this were serious: one of my primary developers lost
half a day, and the project database was offline for a whole day, and
during some of that, an important external customer could not use it.

It is NOT considered usual UNIX system management practice to change
/usr/local to be non readable to the world.

One of the Project Database developers suggested to me that this outage
was the inevitable result of too many cooks touching Arachne, and that
the solution was to take the Servicing of the OS and of Oracle away from
the Service Process personnel currently assigned to it, and instead let
the Delivery Personnel manage the whole system, perhaps even to site the
hardware in E40.

I think such a response is perhaps too strong.  But this outage
represents to me another example of the gap between what I expect and
what I get from W91.  Again, I fear that the solution will be to have
long meetings going through individual items of history, that never
correct the systemic fault: It seems like the way folks in E40 design
services just never aligns with the way folks in W91 take care of
services. SOMEBODY has got to do a paradigm shift!

-wdc



home help back first fref pref prev next nref lref last post