[16] in Athena_Backup_System

home help back first fref pref prev next nref lref last post

How to design a specialized database access routine

daemon@ATHENA.MIT.EDU (Theodore Ts'o)
Sun Aug 28 14:18:06 1994

Date: Sun, 28 Aug 94 14:17:36 -0400
From: Theodore Ts'o <tytso@MIT.EDU>
To: athena-backup@MIT.EDU


In one or two of the discussions after the design forum, I heard more
than one developer express the opinion that the requirement that the
backup system be able to not require intervention by a wizard after an
unclean shutdown/power-off meant that we had to use a full-bore,
commercial relational database package.

This is not true, and I'm going to demonstrate this by outline a brief
design to show how it might be done.  It's really not that hard!

1)  For each "table" that you might have designed in a relational
database, create a C structure.  (In the case of the backup system, there
will probably be roughly three such tables: one for a keep tracking of
volumes, one for keeping track of dumpsets, one for keeping track of
tapes).  This is something that you'd probably would have if you were
using a relational database anyway.


2) For each of these C structures, create encoding and decoding routines
that encode and decode the C structures into a binary blob of data that
can be stuffed into a DBM database.  The binary blob of data should have
a major/minor version number at the beginning.  The minor version number
will allow you to later extend the C structure, and then have the
decoding routines be able to recognize whether a binary blob is one that
has the extended field or not.  

(This is actually better than a professional database, for which the
only way you can extend a table is to create a temporary table, and then copy
all the records from the old table to the temporary table.  Then, since you
don't have a rename table operation in SQL, you must then drop the old
table, recreate the old table with the new fields, and then copy all the
records from the temporary table to the new table.  By our putting in
version numbers, we will be able to add a new field to the database
without having to do all of this needless copying, like we would if we
used Ingres or Oracle or Sybase.)

Of course, these encoding and decoding routines should be rigorously
tested.  Since what they do is take a C structure and turn it into a
binary blob of data, it is relatively easy to design both black box and
glass box tests to make sure the encoding and decoding routines really
are robust.  

Note that it is much, much easier to test these specialized routines for
correctness than it is to test routines that try to be general (such as
those used in professional databases).  Hence, it is quite reasonable
for us to expect our specialized routines will be at least as robust
and bug-free as professional general database, since they will likely
much smaller and much easier to verify correctness.


3)  Next, create subroutines that are used to delete, insert, and update
records for each C structure.  For each data structure, pick one field
which is guaranteed to be unique.  (I.e., volume name, tapeset name,
tape serial number, etc.)  That field will be the primary key for the
data structure, and it will also be the key for the DBM database.  There
will also be secondary indexes for those fields that require them.  Each
secondary index will be implemented by a DBM database, where the key is
the secondary key, and the data field is the primary key.

All of the routines that modify the on-disk DBM database will do the
following.  First, they will write a record to the journal log.  The
journal log is a flat-ascii text file, where each record is a line.
The database routines open the journal log in append and O_SYNC mode.
The record describes the change which the add/delete/modify routine is
about to make to the DBM databases.  The general form of the record is:

ADD	tapeset		"M23"  12  "string field 2" 23 "string field 3" ....
MODIFY	tapeset		"M23"  12  "string field 2" 23 "string field 3" ....
DELETE	tapeset		"M23"

As soon as this record is written to the journal log, it is immediately
flushed to disk; when this happen, this is the commit point for the
modification.  

The routines then proceed to modify all of the relevant database files.
In the case of an record addition operation, this means adding the
structure into the primary DBM database, and then adding the appropriate
index field to all of the secondary records.


4)  Periodically, the database is checkpointed as follows.  First, all
database modifications are prohibited during the check pointing process.
Secondly, all the of the modifications to DBM databases are flushed to
disk.  (Most DBM databases cache information in memory, and we need to
make sure that the disk copies are stable.)   Next, all of the relevant
database files are copied to a new directory.  Then, a pointer file is
created which points to the current position in the journal log; this
reflects the fact that the checkpoint reflects all of the changes up to
where the pointer file points to in the journal log.  

Finally, we execute the checkpoint swap operation; if we crash while a
checkpoint swap is happening, human intervention will be required to
complete the checkpoint swap.  The swap operation is simply one of
renaming the current checkpoint to "checkpoint.old", and renaming the
new checkpoint as the current checkpoint.  

If the backup system ever suffers from a catastrophic crash, it doesn't
matter if the main DBM files are completely trashed.  The recovery
process (which can happen automatically), is that checkpoint DBM files
are copied to the main DBM files.  The recovery process then runs over
the journal log, starting from the place indicated by the pointer file,
to update the checkpoint files so that they are updated to the last
committed operation (remember, an operation is defined as committed as
soon as it is written out to the journal file; this is why).

(For an example of how this works, observe that if the Moira database
ever crashes, we can take the previous night's backup, and reload it
into Ingres.  We can then take the journal log file from moira, and
massage it a little, and then feed the result straight into mrtest, to
update the database since the last night's full backup.  Unfortunately,
Moira's journal recovery process is not fully automatic, but then again
it was not designed as the primary journal recovery system; Ingres
supposedly has the journalling system at a lower layer, which is
supposed to be automatic.  Too bad it doesn't always work.  :-)


5)  Periodically, we will want to roll over the journal log.  To do
this, we simply do a checkpoint operation, but instead of setting the
pointer file the current location, we set it to zero, and move the
journal log out of the way.

We will likely want to keep the journal log, since it can be extremely
handy for all sorts of operations.  For example, the journal log can
serve as a wonderful source of test cases.  It can also be used as a
compact form of all changes made to the database.  Despite the fact that
Moira runs on a full relational database (which keeps its own internal
journal log), Moira keeps its own application-level journal log, simply
because of the other useful analysis and test case possibilities of
having an accessible journal log.  


6) Export your set of database access routines as its own process.  I
would recommend that instead of relying on the RPC compiler (if you use
an RPC) to marshal and unmarshal the data structures to be stored;
instead, write versions of the encoders and decoders which translates
the C structure to an ascii text line, with each field separated with a
space, and strings surrounded with double quotes (and appropriate
backslash quoting if necessary).  The advantage of doing this is that
you can use the identical code for running the journal log, as you use
for accepting database modification requests.  This way, you only have
to test this one interface extremely well, and you will know that if the
code that decoders an database modification request works (and it will
be one of the more heavily exercised portions of the code), since the
same path is used to run the journal log, you can be that much more
confident that the checkpoint recovery code will work well, without
needing to have to do extra testing.


--------------------------------------------------

Anyway, this is a basic design outline of how you might design a
fully-recoverable, specialized database system, without needing to
resort to a full-bore Ingres, Oracle, or Sybase system.  I include it
here not because I think that this is the definitive way that it can be
done --- I've made various design choices in the above demonstration
that perhaps should be done differently --- but to prove that it CAN be
done, and without adding *that* much more complexity.  If the necessary
routines are isolated behind an appropriate abstraction barrier, the
testing and development costs really aren't that bad.

What do you gain by doing this?  Well, a couple of things.  First of
all, I would guess that a specialized database system is probably
smaller, in terms of lines of code, than a fully generalized database
system, by at least two orders of magnitude.  The reason for this?
Because it requires more work up-front by the programmer to do the work;
a generalized database system has to redo all of that work in the
computer, each time you execute a SQL database query.  But since the
computer is trying to do the work of the human programmer, it adds much
complexity to the generalized database system.

The problem is this complexity leads to slower execution, bugs, and
maintenance headaches.  The theory is that this maintenance headaches
isn't our problem, since the database vendor is supposed to take care of
this for us.  In practice, however, this doesn't work out for two
reasons.  

First, the number of bugs goes up by at least the square or the cube of
the complexity, so there are a lot more bugs for the vendor to fix, even
if they can afford to put the two orders of magnitude more people to
work on the database.

Secondly, due to marketplace pressures, the vendors are constantly
pressured to release new versions of the database.  These new versions
of the database may fix some bugs, but marketplace pressures dictate
that the vendors must constantly be adding new features to the database.
These features may not be ones that we need, at least not for our
application.  (For example, the latest buzzword feature is
object-oriented databases.  Previously, the latest buzzword was
distributed databases, which basically meant databases sending passwords
stored on hard disks over the network in the clear to authenticate the
equivalent of an rsh connection to an SQL monitor.  Sigh...)  The
problem is that due to the inherent complexity of these commercial
databases, it is extremely easy for the introduction of these new
features to cause unintentional bugs in other parts of the system.
Hence, even if we don't need the new features, and don't use them, those
features may cause new bugs in portions of the database which we *do*
use.

Now, one solution might be that we could not upgrade the database
software, and keep it at the old version of the database.  If we have a
problem, the vendor will not likely ship us a corrected version of the
database which only fixes that one bug, and keeps the rest of the
database unchanged.  Instead, they will likely demand that we upgrade to
their latest database.  If 2 or 3 years have elapsed since we last
upgraded, this may be quite painful, since they may have made various
API changes since then.

Alternatively, if we control the sources ourselves, we can freeze
development on the database routines once the basic functionality we
require has been implemented.  The argument could be made that this is
no different from freezing on a certain version of a commercial database
system --- in both cases, we have no support: our homebrew solution,
since we wrote it ourselves, and the commercial version, because we are
using an old version of it.  This argument is flawed, though, since the
difference is that (a) our homebrew version will be significantly simpler
than the commercial, general database, and (b) we have sources to our
homebrew version, so we can fix it ourselves.  Since the specialized
version is much simpler, we will likely *be* able to fix it ourselves,
if a bug crops up.

The bottom line is that if we want to develop a system which we can
write, and then essentially forget about, it is best if we do it using
specialized routines coded in C.  If we are willing continuously work on
the system as the vendor comes up with new versions, then we can use
solution provided by a vendor.  This observation is generally true,
whether it depends on databases or operating systems.  Sometimes, it is
appropriate to choose a vendor-supplied solution.  For example, in the
OS case, we are using vendor-supplied OS's, instead of trying to write
our own.  This was obviously the right choice.  One of the prices we pay
is that we are always forced to upgrade to the new OS version if we want
support, or if we want specific bugs fixed.  Yet when we upgrade, legacy
software which we haven't touched in three years might break, thanks to
gratuitous vendor changes.   

In the OS case, the deciding factor was the extra constraint that we
need to be able to support third party packages.  However, in the
database case, we don't have this extra constraint.  So, we have to ask
the question of whether the costs of using the the vendor-provided
solutions are worth it.  Yes, it does mean more up-front work; but I
strongly believe that it will significantly reduce the long-term costs,
both in terms of support contracts and our own required programming
maintenance, as we move into the future.

						- Ted



home help back first fref pref prev next nref lref last post