[15] in Athena_Backup_System

home help back first fref pref prev next nref lref last post

Technology issues.....

daemon@ATHENA.MIT.EDU (Theodore Ts'o)
Fri Aug 26 20:26:49 1994

Date: Fri, 26 Aug 94 20:26:30 EDT
From: tytso@MIT.EDU (Theodore Ts'o)
To: athena-backup@MIT.EDU


Thinking about over the design forum, I don't believe I did a good job
in presenting my objections to the technologies in use.  While we may
have stopped progress in a direction which I believe to be a mistake, we
also were not able to move forward in a productive fashion after the
meeting.  If anyone was hurt by my blunt statements, I apologize.  I let
myself get drawn into a technical flaming style which was not the most
productive moving this project along.  (I hope, though, that my
objections weren't a surprise.  Many of my comments I had made in
private discussions previously, especially those regarding the threads
and RDBMS package.)  In any case, let me try to do better here.

What we are facing are some fundamental software design issues; and it
is fair to say that with new technologies, come risks.  Unfortunately,
our past trask record with trying new technologies has simply not been
good.  Whether it be with Apollo NCS, or Ingres, or C++ (once, after the
original developer left, we were unable to recompile a working binary
from source, due to a mismatch of C++ libraries, so we have been reduced
to emacs'ing that binary in order to change pathnames), our initial
flirtations with those technologies have not always been smooth sailing.

So, when the proposal is to try mixing three new technologies together,
which we as a group have had little to no experience, this should raise
some caution flags.  Especially, as I will note below, there may be
cases where we cannot mix them together at all.


	* Sun/ONC RPC

Although we may have spent the most time on this, I don't have as much
problem with this.  I believe that it would be better to define a
simple, extensible protocol by hand, and then code it; it takes up less
time, and if you're careful about how you code it, it can be more
maintainable in the end.  On the other hand, I don't belive that using
simple RPC package, such as Sun RPC, will cause grevious harm.

However, I believe that we will have to maintain that package as a first
class object, above and beyond a mere subroutine package which is used
by the backup system.  Whenever a new platform becomes available, we
should port the RPC to that new platform, even if it's not immediately
necessary.  We should make sure that we have at least one or two people
intimately familiar with the details of the RPC package, and do some
maintenance work on it to make sure that it remains portable and that it
actually works on all platforms.  We should have test suites that can be
automated, to make sure that it actually works.  

This is where it takes more work to use an RPC than write some
specialized code --- it takes more work to maintain and run test suites
for a general piece of code.  And, as many software engineering texts
will tell us, most of the work in a software project is involved in
maintaining the code, not writing it the first time.  So, the advantage
of not needing to write an RPC package which is available to us vs.
needing to code the speciailized code to do the backup protocol is not
the important part.  The real killer is the amount of effort to maintain
(and port, in the future) the whole RPC package compared to some
well-written, specialized code.  (Why is it that we had no problems
trying to port the kadmin marshalling and unmarshalling code; yet Apollo
NCS has proven quite difficult to port?)

On the other hand, if we're actually going to use this RPC for other
projects, then perhaps it'd be worth it to actually put in the work to
adopt the RPC for use in our various subsystems in the future.  But
let's not kid ourselves --- it will end up being more work for the
backup system.  We can treat it as investment, but it will be more
up-front work.

	* Threads

I don't believe threads would be a good idea for these reasons:

	1.  It's not a standard yet.  The Posix threads standard is at
draft 8 (or later; it's hard to keep track) at this point; the DCE
threads are at draft 4, and I believe that the OSF has indicated that
they will eventually move to whatever Posix standardizes on.  So if we
code to something that's versus draft 4, we may be in for a certain
amount of recoding work when the Posix threads finally settles down.
Another possibility is that we don't recode; but that only works if we
are using a package with source code that we are willing to support this
legacy version of the threads interface.

	2.  None of our libraries (Kerberis, Hesiod, com_err, dbm, etc.)
are thread-safe; when they were written, there was no assumption that
they had to be re-entrant.  If we just blindly call these libraries from
within a threaded environment, all sorts of strange bugs will crop up.
The same is true of the SunRPC library; it's almost certainly not
thread-safe either.

	3.  The run-time library for an relational database, if we use
one, is also almost certainly not thread-safe.  Consider that
information regarding transaction locking has to be stored in the
per-thread structure, which means that if the run-time library did use
threads, it would be tied to a particular threads package.

	4.  Threaded programs can be harder to maintain; especially by
people who don't know what they are doing.  The issues that come up with
threaded programs are indentical to those that come up with kernel
programming --- and not everyone in our group has experience with kernel
code.  

	5.  Threaded programs are much harder to debug; now when a
program crashes, you have not only one PC and stack to look through, you
have many.  And when you single-step through a threaded program, the
race conditions that were causing the program to crash no longer are a
problem, since running the threaded program under a debugger changes the
relative timing of the threads.  More important, we don't have debugging
tools that can deal with the multiple stacks of a threaded environment.
So picking through the carcass a threaded program that has core dumped
may be a major chore.

All of this does not mean that we should never do threads --- but it
does mean that if we want to do it, a major investment in (a) declaring
one threads package as The Threads Package for our group; (b) converting
all of our libraries to be threads-safe (which will tie them to that 
particular threads package); (c) investing in appropriate development
tools (such as debuggers) for threaded programs, and so on.

There are also alternatives to using threads; if there are only a
limited number of threads that you'd, we can simply spawn some new
processes.  Under Unix, the text space is going to get shared anyway, so
it's only the data space which is not shared.  Another alternative is to
use a select loop, which is the traditional way to get around the
blocking I/O problem.  It's what Moira, Zephyr, and the X Toolkit use
instead of threads, and while it is more work to program in this
paradigm, it works just fine.

	* A Relational Database

The big question here is the costs associated with the relational
database.  There were assertions made at the design forum that our
experiences with Ingres are atypical.  That may be; or they may not be.
Either way, I think it is clear we need to gather more facts regarding
this.

When Jeff talked about putting the backup system through an Acceptance
process, I think one of the things that point should denote is that this
should be a point where the project should cease requiring Development
resources, at least as far as day-to-day maintenance.  Given the size of
our group, it seems clear to me that once running, it should not require
any assistance from us to keep it running and stable.

The problem is that large scale relational databases are aimed at large
shops which can afford to have a full-time database manager, and a team
of database programmers to keep the database up and running.  The
question is whether any of the database vendors can deliver us a system
which we can set up and install, and then not require a programmer to
keep tabs on it for periodic maintenance.  I'm worried that all of the
databases will have something like the "RS/6000 disease", where IBM
assumes that each RS/6000 has a maintainer, just like in the mainframe
world.  We need to make sure that Oracle and/or Sybase has not made
similar assumptions regarding the need for a database manager to keep
watch over their databases.

For example, from talking to Gerry Issacson, for the CAO database, ASD
has 2 or 3 people in their database group, that maintain it and one
other database; and the customer, the Comptroller's office, employs one
full-time programmer that works only on the CAO database.  We can check
with the Comptroller and ASD to see if this is really the case, or if
I've received an over-simplified description from Gerry; but it is
certainly clear that we do not want to run under such a model.

One alternate solution is that we could find out how much ASD would be
willing to charge (yes, charge us) for keeping the database under
professional care and feeding.  This might include updating to the
latest version of Oracle each time Oracle releases a new version, every
six months or so --- if that is the only we can get support --- and
dealing with new bugs that crop up in each new release.  (Maybe Ingres
is atypical, but ask Miki about all of the problems she's and with each
new jumbo kernel patch from Solaris causing fixing some old bugs, and
creating some new ones; if the only way we can get support is to run the
latest version, we'd be really stuck, and be forced to always upgrade.)



The bottom line isn't really a technical one, but one of trying to
understand the costs associated with this design choice.  One of the
things which I think is really broken about our development process
(such as it is) that it doesn't take into account the costs associated
with the system.  We ask our "customers" for their requirements, but we
don't tell them how much their requirements will cost; and often, the
"customers" who are specifying the requirements aren't the ones footing
the bill, anyway.  This is a broken accountability loop that has all
sorts of problems.  The way it affects us here is that (having nowhere
else to be expressed) it pops up in our design discussions, as

	(a) we want to pursue a path that results in the least amount of
		work for us, as an implicit part of the design.  In no
		case do we want to do an "unreasonable" amount of work.
		(Yes, what is reasonable and unreasonable is not well
		defined; each of us have different ideas of what is
		reasonable.)

	(b) Jeff wants a design which doesn't result in an
		implementation which requires constant care and feeding.
		(After all; a design that requires constant care and
		feeding isn't a problem, as long as you can afford to 
		the necessary EFT's to do said care and feeding.)
		
An unanswered question is --- what is if these desires (which really
driven out of cost containment issues) are incompatible with the set of
requirements which Kim and her group articulated?  What requirements get
dropped, and what requirements must be implemented?

In any case, it is important that the design address both the short term
costs (i.e., our implementation time) and the long term costs (our
maintenance strategy; what does a Oracle or Sybase really require for
day to day maintenance?).  Fortunately, people in ASD should be able to
help us gather data on the long-term costs issue.  Since this is my main
objection with using a relational database, if we can get assurance that
in fact Oracle really is different;

	* that updating to new versions are painless (even if we are
		using Embedded SQL; Ingres has changed the ESQL
		interface on us before); 

	* or, that we don't need to upgrade to the latest version every
		single time to get support; 

	* that Oracle will actually be able to give us rational support
		if we have a problem with our database;

	* that the yearly maintenance and upgrade fees are reasonable;

	* that it really is reasonable to develop a system using Oracle
		and once tested and accepted, expect it to require no
		(or at most minimal) developer attention;

then my objections will be satisfied.  

Of course, an alternate solution path is if we can get if Operations is
willing to pay what it will really cost to run this system on a
long-term basis, either by paying ASD, or creating a new database
programmer slot in their group, if that is what is necessary.  

Or yet again, if we get information from ASD that it really does require
someone to keep watch over the database, we could be willing to bet that
it really won't cost that much to run the system.  If we are wrong, the
cost to us as the Development group is that we might be stuck forever
maintaining the backup system, the way we maintain Moira.

							- Ted

home help back first fref pref prev next nref lref last post