[923] in Commercialization & Privatization of the Internet

home help back first fref pref prev next nref lref last post

how big is the LoC?

daemon@ATHENA.MIT.EDU (Barry Shein)
Tue Jul 9 02:47:38 1991

Date: Tue, 9 Jul 91 02:15:37 -0400
From: bzs@world.std.com (Barry Shein)
To: jqj@duff.uoregon.edu
Cc: craig@sics.se, com-priv@uu.psi.com
In-Reply-To: jqj@duff.uoregon.edu's message of Mon, 08 Jul 91 10:29:01 MDT <9107081728.AA10409@phloem.uoregon.edu>


>From: jqj@duff.uoregon.edu
>I don't think your calculations really address the question.  The LoC does
>not store its data as bitmaps, so why should a raster bitmap be the
>appropriate way to measure the size of a graphical item in the LoC?  For
>example, we might do the calculation assuming some reasonable compression
>algorithm.  If we figure 1:2 for text and 1:20 for 8 bit images, we get 1
>million graphical items taking only 4 TB, and 80 million 300KB text items
>taking 12 TB.  So 25 TB total might in fact be reasonable (or might not).

I have done some thinking on this issue and would like to comment.

It would seem to me foolhardy to scan 80M texts, an enormous job, and
not save high-quality bitmaps. Any OCR or other methods of compression
(other than bitmap compression which would return the original) would
limit future generations with new technology to quite possibly have to
do much of it over as either better quality algorithms (any bulk OCR
is error-prone at at least one error per page) or new questions arise.

For example, suppose you have stored the text to a book as OCR'd
ASCII, and my research is on fonts and typographical conventions used
by that particular publisher or era?

We would have thrown away enormous amounts of information.

I think the only safe storage method, other than for small collections
(e.g. reference) which will inevitably redone, is a bitmap scan at an
equal or higher resolution to the original document, basically a high
quality, digital photo. I believe we can reasonably calculate what
highest resolution scan is necessary.

We might well then store a highly compressed and processed version
(certainly OCR'd) for on-line use with current technologies, and
archive the bitmaps for posterity elsewhere, off-line. So long as
they're archived they can be stored fairly cheaply.

Even your high figure of 1000TB is about 500,000 Exabyte tapes, not
really that many objects for someone like the LoC to store. And we can
assume density improvements of 100 or 1000 fold in the not too distant
future, certainly long before the scanning were finished, it might
only be 500 tapes, a mere bagatelle for archivists.

Perhaps my comments are a bit orthogonal to yours, but I believe they
are important considerations. Particularly where cultural artifacts,
who's mere appearance is likely to be of value, are concerned.

        -Barry Shein

Software Tool & Die    | bzs@world.std.com          | uunet!world!bzs
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

home help back first fref pref prev next nref lref last post