[185] in Information Retrieval
interesting stuff at Hahvard
daemon@ATHENA.MIT.EDU (Tom Owens)
Mon Jul 26 14:58:25 1993
To: elibdev@MIT.EDU
Date: Mon, 26 Jul 93 14:55:31 EDT
From: Tom Owens <owens@Athena.MIT.EDU>
------- Forwarded Message
Harvard University
Computer Science Systems Seminar
SCATTER/GATHER: A NEW BROWSING PARADIGM FOR INFORMATION RETRIEVAL
David Karger
` Thursday, July 29, 1993
4PM
Aiken Computation Laboratory 101
Tea at 3:30 in Aiken Main Lobby
With the ongoing explosion in the amount of information available
on-line, the question of how to actually retrieve useful information
has become increasingly important. Classical information retrieval
techniques have focused on keyword approaches: the user gives a query
by linking relevant keywords with boolean operators, for example
"communist AND dictator AND Russia AND NOT Brezhnev."
Keyword based approaches suffer from several drawbacks. They require
that the user have a good understanding of the corpus he is querying
("computer" might be a good search term in an encyclopedia, but would
be useless in a database of ACM abstracts); and also that he have a
narrowly focused query (faced with a corpus of New York Times
articles, how might keyword search help determine "what happened this
year?"). In a certain sense, keyword based queries provide the
functionality of the index in the back of a book: useful if you are
familiar with the contents and have a specific information need;
useless if you have not seen the book before or are examining a
broader question.
We have proposed and implemented a new retrieval paradigm called
scatter/gather. It is based on automated clustering of documents by
subject. It presents the user with a table of contents for the corpus
under examination, and allows the user to home in on interesting
topics. This provides a browsing tool which complements and extends
traditional search techniques. This tool is fully automated,
requiring no human preprocessing to assign keywords or subject labels
to the documents. Instead, it is the computer which determines the
topics; the user need only determine which topics appear relevant to
his information need. Keyword searches require that the user come up
with the right question; with scatter/gather it is the computer which
asks the questions, while the user has the much easier task of
answering them.
Scatter/gather also serves as an effective companion to keyword
search. It can be used before searching to help formulate a good
query, and it can be used after keyword search on those many occasions
when a query retrieves too many documents to examine individually.
To support scatter/gather, several algorithmic questions had to be
answered. Classical document clustering algorithms required time
quadratic in the number of documents in the collection. This is
prohibitively expensive, even as a preprocessing step, when corpora
can contain billions of documents. We therefore developed new
clustering algorithms which run in linear time. While linear time is
feasible off-line, once the user arrives an information retrieval
system must respond in constant time (typically under 10 seconds)
regardless of the size of a corpus. We therefore developed linear
time preprocessing algorithms and data structures which allow us to
support scatter/gather in constant time on arbitrarily large document
collections.
Joint work with Doug Cutting, Jan Pedersen, and John Tukey.
Host: Cliff Young
------- End of Forwarded Message