[731] in SIPB-AFS-requests

home help back first fref pref prev next nref lref last post

lq-text project volume

daemon@ATHENA.MIT.EDU (Calvin Clark)
Sun Aug 2 21:49:05 1992

Date: Sun, 2 Aug 92 21:48:41 -0400
From: Calvin Clark <ckclark@mit.edu>
To: sipb-afsreq@Athena.MIT.EDU
Reply-To: ckclark@mit.edu

I created a volume for working on the lq-text package.  By "working" I
mean porting to Athena platforms.  I believe it's a useful package, and
I managed to get an earlier version to work on the DECstation, using it
to index portions of the Shakespeare locker.


Here is a description from the README file:

Lqtext is a text retrieval package.

That means you can tell it about lots of files, and later you can ask
it questions about them.
The questions have to be
        which files contain this word?
        which files contain this phrase?
but this information turns out to be rather useful.

Lqtext has been designed to be reasonably fast.  It uses an inverted
index, which is simply a kind of database.  This tends to be smaller than
the size of the data, but more than half as large.  You still need to keep
the original data.

Commands are
        lqword -- information about words
        lqphrase -- look up phrases
        lqaddfile -- add files to the database (at any time)
        lqshow -- show the matches on the screen (uses curses)
        lqtext -- curses-based front end.
        lq -- shell-script front end
        lqkwik -- creates keyword-in-context indexes (this is fun!)

There are about 11,000 lines of C in total, or which 8,000 are the
text database and 3,000 are the curses front end (lqtext).  Well, last time
I counted, anyway.

Here are some examples, based mostly on the (King James) New Testament,
simply because that is what I have lying around.

$ time lqphrase 'wept bitterly'
0000017 0000032 NT/Matthew/matt26.kjv
0000013 0000027 NT/Luke/luke22.kjv
real        0.2
user        0.0
sys         0.1

$ time lqword 'jesus' > /dev/null
real        1.0
user        0.6
sys         0.2
$ time lqword 'jesus' > XXX
real        1.0
user        0.6
sys         0.3
$ wc XXX
    986   6896  59907 XXX
$ cat XXX
       WID | Where   | Total   | Word
===========|=========|=========|============================================
       308 |    4736 |     983 | jesu
               Jesus |     0/  2 F=99 | NT/Matthew/matt01.kjv
               Jesus |     2/ 41 F=3  | NT/Matthew/matt01.kjv
               Jesus |     3/ 14 F=99 | NT/Matthew/matt01.kjv
(and so on for 983 lines)
So there are nine hundred and eighty-three matches.  The rest of
the listing shows for each match the block in the file, the word
within the block, a flags field and the filename.
The "where" in the header shows the address in the database, and
WID is the word's unique identifier.
The above timings were on a 16 MHz 386.

home help back first fref pref prev next nref lref last post