[185] in Information Retrieval

home help back first fref pref prev next nref lref last post

interesting stuff at Hahvard

daemon@ATHENA.MIT.EDU (Tom Owens)
Mon Jul 26 14:58:25 1993

To: elibdev@MIT.EDU
Date: Mon, 26 Jul 93 14:55:31 EDT
From: Tom Owens <owens@Athena.MIT.EDU>


------- Forwarded Message

                          Harvard University
                   Computer Science Systems Seminar



  SCATTER/GATHER: A NEW BROWSING PARADIGM FOR INFORMATION RETRIEVAL

                             David Karger


`                      Thursday, July 29, 1993
                                 4PM
                   Aiken Computation Laboratory 101
                   Tea at 3:30 in Aiken Main Lobby

With  the ongoing  explosion  in the amount   of information available
on-line, the  question of how  to actually retrieve useful information
has become  increasingly important.   Classical  information retrieval
techniques have focused on keyword approaches: the  user gives a query
by  linking relevant keywords  with   boolean  operators, for  example
"communist AND dictator AND Russia AND NOT Brezhnev."

Keyword based approaches suffer from  several drawbacks.  They require
that the user  have a good understanding  of the corpus he is querying
("computer" might be a good search term in an encyclopedia,  but would
be useless  in a  database of ACM abstracts); and  also that he have a
narrowly  focused query    (faced with  a corpus   of   New York Times
articles, how might keyword  search help determine "what happened this
year?").  In a   certain  sense,  keyword based  queries   provide the
functionality of the  index in  the back of a  book: useful if you are
familiar  with the  contents and have  a  specific  information  need;
useless if you  have not  seen  the book  before  or  are examining  a
broader question.

We have proposed  and   implemented a new  retrieval  paradigm  called
scatter/gather.  It is based on  automated clustering of documents  by
subject.  It presents the user with a table of contents for the corpus
under  examination,  and allows  the   user to home  in on interesting
topics.  This provides a  browsing tool which  complements and extends
traditional  search  techniques.   This   tool  is   fully  automated,
requiring no human preprocessing  to assign keywords or subject labels
to  the documents.  Instead, it  is  the computer which determines the
topics; the  user need only determine which  topics appear relevant to
his information need.  Keyword searches require that  the user come up
with the right question; with scatter/gather it is  the computer which
asks the  questions,  while the   user  has the  much easier  task  of
answering them.

Scatter/gather   also serves as    an  effective companion  to keyword
search.   It can be  used before searching  to   help formulate a good
query, and it can be used after keyword search on those many occasions
when a query retrieves too many documents to examine individually.

To  support scatter/gather, several algorithmic  questions   had to be
answered.  Classical  document   clustering algorithms  required  time
quadratic in  the number  of  documents  in  the collection.  This  is
prohibitively expensive,  even as  a preprocessing  step, when corpora
can contain   billions of  documents.    We  therefore developed   new
clustering algorithms which run in  linear time.  While linear time is
feasible off-line,  once  the  user arrives an  information  retrieval
system must respond  in   constant time (typically under  10  seconds)
regardless of the  size of  a corpus.  We   therefore developed linear
time  preprocessing algorithms and data  structures  which allow us to
support scatter/gather in constant  time on arbitrarily large document
collections.

Joint work with Doug Cutting, Jan Pedersen, and John Tukey.


Host: Cliff Young



------- End of Forwarded Message


home help back first fref pref prev next nref lref last post