[913] in Commercialization & Privatization of the Internet

home help back first fref pref prev next nref lref last post

how big is the LoC?

daemon@ATHENA.MIT.EDU (Craig Partridge)
Fri Jul 5 05:16:55 1991

To: com-priv@psi.com
From: Craig Partridge <craig@sics.se>
Date: Fri, 05 Jul 91 11:15:24 +0200


    Yesterday's International Herald Tribune carried a shortened article
by John Markoff on information retrieval over data networks.

    At one point the article mentions that the Library of Congress has
80 million items which represent an estimated 25 terabytes of data.
While that estimate is much higher than some I've heard previously
(which were in the gigabytes range), I think that number is still
woefully low (orders of magnitude).

First I did a bit of reverse engineering of the number:

    25 terabytes divided by 80 million items gives an average of
    312,500 bytes per item. If we estimate the average page contains
    about 35 lines of text with 50 chars per line (both numbers I
    think are conservative) we find that each item has about 180
    pages in it.  So it sounds like the terabytes number assumes
    everything is character data.

Problem is, a lot of stuff isn't character data.  The LoC has an
extensive manuscript collection (including several sets of presidential
papers, maps, and a fine rare book collection).  Also, some of the
books are more than just bytes -- they are fine examples of the
printing art.  Those kinds of items probably have to be stored and
retrieve as images if researchers are to properly use them.  So
the next problem is how big is an image.  I decided to try to
be conservative:

    Let's assume each page is a 5-1/2 by 8-1/2 page with containing a
    5 * 7 inch image, and that we are satisfied with Apple LaserWriter
    image quality (300 pixels to the inch).   Note we probably aren't
    really satisfied with LaserWriter quality, nor, in fact, with
    black and white images (I'd like to see the hand drawn capital letters
    in the Gutenberg bible in their full color), and many images are
    far larger than 5 * 7, but it is a start.

    Anyway, a 5 * 7 image is about 390,000 bytes (5 * 7 * 300 * 300)/8.

    So one 180 page item, as a set of images, is 70,200,000 bytes,
    or about 200 times larger than is consistent with a 25 terabyte
    estimate. [Some might argue that manuscripts are typically shorter --
    that's true, but some are also gigantically larger -- like Washington's
    papers -- which might be cataloged as one item].

    If we add 32-bit color pixels to the equation, we're about 6,000
    times larger...  Improving bit resolution for the black and white
    images also boosts size a lot.

So the question is, how much of the stuff in the LoC must be stored as
images? I don't know -- I haven't had a chance to call the LoC and ask for
an estimate of the number of manuscript items in their collection and
I'm about to go on vacation for a week or so and wanted to pound this
out before I forgot the subject again.  However, I'll bet the LoC manuscript
holdings are in the millions.  By the rough back-of-the-envelope calculation
above, every million items will consume in excess 70 terabytes...

One might also consider the fun of sending the contents of the US National
Archives, or the British Public Record Office, both of which have holdings
in the millions.

Craig Partridge (craig@sics.se)
on sabbatical at SICS

PS: One key motivation for this note is all the comments about a gigabit
network being so fast that one can send "the total of human knowledge"
in a few minutes or hours.  The accumulated detris of history is not
that small...

home help back first fref pref prev next nref lref last post