[913] in Commercialization & Privatization of the Internet
how big is the LoC?
daemon@ATHENA.MIT.EDU (Craig Partridge)
Fri Jul 5 05:16:55 1991
To: com-priv@psi.com
From: Craig Partridge <craig@sics.se>
Date: Fri, 05 Jul 91 11:15:24 +0200
Yesterday's International Herald Tribune carried a shortened article
by John Markoff on information retrieval over data networks.
At one point the article mentions that the Library of Congress has
80 million items which represent an estimated 25 terabytes of data.
While that estimate is much higher than some I've heard previously
(which were in the gigabytes range), I think that number is still
woefully low (orders of magnitude).
First I did a bit of reverse engineering of the number:
25 terabytes divided by 80 million items gives an average of
312,500 bytes per item. If we estimate the average page contains
about 35 lines of text with 50 chars per line (both numbers I
think are conservative) we find that each item has about 180
pages in it. So it sounds like the terabytes number assumes
everything is character data.
Problem is, a lot of stuff isn't character data. The LoC has an
extensive manuscript collection (including several sets of presidential
papers, maps, and a fine rare book collection). Also, some of the
books are more than just bytes -- they are fine examples of the
printing art. Those kinds of items probably have to be stored and
retrieve as images if researchers are to properly use them. So
the next problem is how big is an image. I decided to try to
be conservative:
Let's assume each page is a 5-1/2 by 8-1/2 page with containing a
5 * 7 inch image, and that we are satisfied with Apple LaserWriter
image quality (300 pixels to the inch). Note we probably aren't
really satisfied with LaserWriter quality, nor, in fact, with
black and white images (I'd like to see the hand drawn capital letters
in the Gutenberg bible in their full color), and many images are
far larger than 5 * 7, but it is a start.
Anyway, a 5 * 7 image is about 390,000 bytes (5 * 7 * 300 * 300)/8.
So one 180 page item, as a set of images, is 70,200,000 bytes,
or about 200 times larger than is consistent with a 25 terabyte
estimate. [Some might argue that manuscripts are typically shorter --
that's true, but some are also gigantically larger -- like Washington's
papers -- which might be cataloged as one item].
If we add 32-bit color pixels to the equation, we're about 6,000
times larger... Improving bit resolution for the black and white
images also boosts size a lot.
So the question is, how much of the stuff in the LoC must be stored as
images? I don't know -- I haven't had a chance to call the LoC and ask for
an estimate of the number of manuscript items in their collection and
I'm about to go on vacation for a week or so and wanted to pound this
out before I forgot the subject again. However, I'll bet the LoC manuscript
holdings are in the millions. By the rough back-of-the-envelope calculation
above, every million items will consume in excess 70 terabytes...
One might also consider the fun of sending the contents of the US National
Archives, or the British Public Record Office, both of which have holdings
in the millions.
Craig Partridge (craig@sics.se)
on sabbatical at SICS
PS: One key motivation for this note is all the comments about a gigabit
network being so fast that one can send "the total of human knowledge"
in a few minutes or hours. The accumulated detris of history is not
that small...