[95] in Information Retrieval

home help back first fref pref prev next nref lref last post

Library 2000 Annual Report

daemon@ATHENA.MIT.EDU (Tim McGovern)
Thu Jun 18 11:13:51 1992

Date: Thu, 18 Jun 92 11:15:26 EST
From: tjm@Eagle.MIT.EDU (Tim McGovern)
To: elibdev@MIT.EDU


------- Forwarded Message

From: ganderso@Athena.MIT.EDU
To: dlicc@MIT.EDU
Date: Tue, 16 Jun 92 11:43:31 EDT

To:   Library Steering Committee
cc:   DLICC

From:  Greg
Date:  16 June 92
Subject:  Annual Report on Library 2000
-------------------------------------------------
Jerry copied Marilyn and me for the annual report which he
submitted for his Library 2000 project.

Greg
-------------------------------------------------
From: Jerome H Saltzer <Saltzer@MIT.EDU>
Subject: Library 2000 annual progress report
Date: Fri, 12 Jun 92 11:37:01 -0700
Sender: saltzer@src.dec.com

Greg and Marilyn,

Each research group at the Lab for Computer Science
prepares an annual progress report to coincide with an
annual meeting to be held next week.  I'm going to miss
the meeting, but for your information, here is the
contribution we are making to the report.

If there is anyone else on campus who would be
interested in seeing this report, please feel free to
pass it along.

                                Jerry

------------------------------------------------------
Library 2000
Annual Progress Report, July 1, 1991--June 30, 1992


Academic Staff

    Jerome H. Saltzer

Research Staff

    Mitchell N. Charity

Undergraduate Students

    Jeremy A. Hylton
    Robert C. Miller
    Arthur W. Min
    Manish D. Muzumdar
    Ronald Weiss (Brandeis University)

Reading Room Staff

    Paula Mickevich
    Carol A. Nicolora
    Maria Sensale
    Rebecca J. Soble

Support Staff

     Lisa M. Kelly


*********************************************************************

LIBRARY 2000

Library 2000 is a new research project, exploring the system
implications of large on-line storage.  The method of the project is
pragmatic, to develop a prototype of an on-line electronic library
using the technology and approaches expected to be feasible in the
year 2000.  Initial support for Library 2000 has come from grants from
the Digital Equipment Corporation and the IBM Corporation, and
uncommitted funds of the Laboratory for Computer Science.

The basic hypothesis of the project is that the technology of on-line
storage, display, and communications will, by the year 2000, make it
economically possible to place the entire contents of a library
on-line and accessible from computer workstations located anywhere.
The goal of Library 2000 is to understand and try out the system
engineering required to fabricate such a future library system.  The
project's vision is that one will be able to browse any book, journal,
paper, thesis, or report in the library using a standard office
desktop computer, and follow citations by pointing--the thing selected
should pop up immediately in an adjacent window.

The key technology required to realize this vision is low-cost disk
storage.  In the last decade, mass-produced disk storage has fallen in
cost/bit by a factor of 1000.  Projections of current research and
development activity suggest that disks in volume production by the
year 2000 will fall in cost/bit by at least another factor of 100.
This change of five orders of magnitude calls for a complete
rethinking both of what is economically feasible, and also of
engineering for effective use.  A back-of-the-envelope estimate
suggests that the cost of the magnetic disk storage needed to hold
scanned images of all the books and serials in a large library will,
in the year 2000, be about equal to the annual budget for that
library.  When floor space is considered, the cost of holding images
on magnetic disk will be substantially less than the cost of holding
the paper form.

Three other technologies are also advancing at a pace that is likely
to provide effective support to an on-line electronic library system:

- Data communications.  Medium bandwidth networks already in place, in
the form of campus networks and the NREN, provide the bandwidth needed
to allow access to a library from a distance.  Planned higher
bandwidth networks should change transmission of scanned images from
the occasional to the routine.

- Display technology.  Multi-plane megapel displays are becoming
common on medium-cost engineering workstations.  By the year 2000,
these displays will probably be standard on the lowest-cost personal
computers.  (The grey-scale capability that comes with a multi-plane
display is critical to make reading of scanned images acceptable.)

-  System organization.  The client/server model of organizing
distributed computation has matured sufficiently that it appears to be
the method of choice in designing an electronic library system.

Our research problem is not to invent or develop any of these four
technologies, but rather to work out the system engineering required
to harness them in a usable form.  The engineering and deployment of
large-scale systems is always accompanied by traps and surprises
beyond those apparent in the component technologies.  Finding things,
linking things together (especially across the network), keeping the
whole system reliable, and making the system last for a period of
decades will all require new ideas.



ARCHITECTURE

The overall system architecture of Library 2000 consists of a
workstation client that is responsible for matters of presentation,
user interaction, and usage coordination, together with a multiplicity
of storage servers and of index servers.  Storage servers hold the raw
information, in at least two forms: scanned bitmap and ASCII text.
Index servers provide indices of the ASCII text to allow searching.
The overall paradigm of use is deceptively simple: a user expresses
interest in some item to the workstation client, the client dispatches
a query to one or more index servers, and if the query is successful,
uses the information returned by an index server to request items
from one or more of the storage servers.

This architecture is appealing because it allows modular, independent,
competitive design and replacement of the user interface, the index
services, and the storage services.  It also permits a uniform user
interface to a wide variety of different library collections as well
as to personal files, mail, or other non-library databases.  The
architecture is simple and the functional boundaries match both
physical and administrative boundaries.  Finally, traditional library
back office operations such as cataloguing, circulation, acquisition,
journal control, etc., fit in gracefully as additional clients and
servers.

Two technology observations interact to produce an interesting
architectural consequence.  The cost of magnetic disk memory has for
many years been between 10 and 100 times cheaper than that of
random-access memory.  Similarly, the amount of space required for
(compressed) scanned images is between 10 and 100 times larger than
that required for the corresponding ASCII text.  The architectural
consequence is that if one spends about the same number of dollars for
each, there will be space in RAM for a complete index of the words
stored in page-image form on the disk.  This observation leads to an
interest in index-preparation and searching algorithms that are
optimized to operate directly in large random-access memory, even for
very large databases.

The network protocol that connects the presentation client with the
index and storage servers is stateless, to achieve robustness in the
face of network and server failures.  It uses unique identifiers, to
allow separation of the index and the storage services.  Storage
servers are replicated with wide geographical diversity, to counter
threats to persistence over time periods measured in decades.


THE PROTOTYPE

The general research strategy of Library 2000 is to build a small but
extendable prototype system, stock it with live data, see how it
works, and then iterate the design at successively larger scales.  The
first prototype was placed in service during the summer of 1991.  It
involves a very simple presentation client accessible by telnet from
anywhere in the Internet, and a combined index/storage server that
contains the catalog card records of the joint reading room of the
M.I.T. Laboratory for Computer Science and the M.I.T.  Artificial
Intelligence Laboratory.  This card catalog has about 15,000 records.
The prototype also contained a second collection of 16,000 library
card records, on the subject of computer science, extracted from the
M.I.T. library catalog.  During the fall, Mitchell Charity added a
third collection, consisting of all available abstracts of about 2000
LCS and AI technical reports and memos.

This online catalog has now become the primary catalog of the LCS/AI
reading room; the paper catalog was closed in January, 1991.  Several
support systems, including a simple emacs-based cataloguing system
were also developed, so that operations of the reading room could be
transferred entirely to the new online catalog.  A shelf-reading
project is underway, and as of the end of the reporting year all
proceedings, all books received since 1980, and all journals are now
fully catalogued and checked.  All technical reports received since
January 1991 are also catalogued; about half of those received before
that date are also represented in the catalog.  The staff of the
reading room reports that circulation of technical reports has
increased markedly since the on-line catalog became available.

During the fall, the entire system migrated from a borrowed VAX server
to a set of three replicated IBM RS-6000 servers, with a view towards
scaling up to much larger quantities of data.  The index was recoded
from the PERL (interpretive) to the C language, and it is now in the
midst of a complete design revision.  By taking advantage of the large
RAM available on the RS-6000s, this revision is intended to provide
the high performance needed to allow indexing of the 700,000-record
M.I.T. library catalog.  During the spring, this new index server was
tested on a 25% sample of the M.I.T. catalog with good results; the
complete index system should be ready to deploy some time this summer.
Mitchell Charity did most of the work on the prototype design and
implementation.

Three related projects were completed this year.  First was a parser
for library standard (MARC) records, by undergraduate Art Min.  Second
was a preliminary prototype of an X-based user interface to the online
catalog, initiated by Mitchell Charity with further development by
Jeremy Hylton.  The third was a completely different user interface
based on the Organization Engine, an information organizing system in
use at the Digital Equipment Corporation Cambridge Research
Laboratory.  This interface was developed as an undergraduate thesis
at Brandeis University by Ron Weiss.

Currently, Manish Muzumdar is developing a general client interface
toolkit, and Robert Miller is working on the indexer overhaul.


JOINT PROJECTS

Library 2000 is working closely with two other groups interested in
this area: the M.I.T. Distributed Library Initiative, and a group of
computer science departments that are collaborating on technical
report distribution.  These collaborations represent an opportunity to
obtain real, production library materials in electronic form, and also
an opportunity to influence the architecture of future library
systems.

The Distributed Library Initiative is a new, joint activity of the
M.I.T. Libraries and M.I.T. Information Systems.  Its purpose is to
develop and experiment with new technologies for the M.I.T.  Library
system.  The DLI is installing prototypes of possible future library
electronic delivery systems, and is working with publishers who are
undertaking experiments with electronic distribution systems.  Library
2000 is working with DLI at several levels to help develop the DLI
plans, to look for opportunities to avoid unnecessary mismatches of
data formats, protocols, and programming interfaces, and to make joint
use of software where feasible.  The acquisition by Library 2000 of
the M.I.T. library catalog and its update stream is one tangible
result of this collaboration.

The technical report collaboration represents an extension of the
prototype system in another interesting direction, that of linking
indexed text with page images.  The plan of the collaborators is that
each university (initially Stanford, the University of California at
Berkeley, Cornell, Carnegie-Mellon, and M.I.T.) would place
page-images of its computer science technical reports in a local
server.  Each would also prepare electronic bibliographic records and
distribute them to the other participants.  Each participant would
then take this set of bibliographic records, place it in some local
index/access system, and work out ways of providing links from those
records to the local and distant page image servers.  This plan has
progressed to the point that the Corporation for National Research
Initiatives, acting as coordinator, has made a proposal that DARPA
fund the project.  In anticipation of funding, an online discussion
group has already developed a proposed format for exchange of the
bibliographic records.  The M.I.T. portion of the proposal is actually
a joint activity of Library 2000 and the M.I.T. Library System, with
the library system planning to do scanning and paper delivery, and a
potentially expanded version of the Library 2000 prototype as the
underlying search, storage, and presentation system.


PLANS

The general plan of the Library 2000 project is to continue the
development, iteration, and extension of the prototype until it
becomes a full-fledged on-line library system.  The immediate
extensions underway are those already mentioned, to the full M.I.T.
library card catalog, and to the page images of the technical report
collaborative project.

The addition of page images actually leads to extensions in several
different areas, each exposing a variety of new system design
problems:

  - development of an image storage service and storage service
        protocol.
  - development of an X-based client for query and image display.
  - collection of original word-processing text for full-text indexing.
  - investigation of how to relate the ASCII text to the corresponding
        scanned images.
  - development of a document linking plan.

Two other, related projects are also contemplated:

  -  implementation of a simple replication system that works well
        under geographic separation of the replicants.
  -  development of a collection discovery and rendezvous system.

These plans are actually too ambitious when compared with the intended
size of the research group; some selection will occur.

*********************************************************************

Talks:

Saltzer, J. H.  Technology, Networks, and the Library of the Future.
Lecture given at M.I.T. EECS Department Colloquium, October 28, 1991.


Publications:

Gong, Li, Lomas, T. M. A., Saltzer, J. H., and Needham, R. M.,
"Protecting Weak Secrets from Guessing Attacks," submitted to the IEEE
Journal of Selected Areas in Communications.

Saltzer, J. H., "File System Indexing, and Backup," in Operating
Systems for the 90's and Beyond, Lecture Notes in Computer Science
563, edited by Arthur Karshmer and Juergen Nehmer, Springer-Verlag,
New York, 1991, pp.  13-19.


Theses Supervised:

Weiss, R.  Integration of the Organization Engine and Library 2000.
Undergraduate honors thesis, Brandeis University Department of
Computer Science, May, 1992.



------- End of Forwarded Message


home help back first fref pref prev next nref lref last post