[5498] in www-talk@info.cern.ch

home help back first fref pref prev next nref lref last post

Re: Searching (was Re: Lotus Notes -- Too much Hype !!!)

daemon@ATHENA.MIT.EDU (Paul Everitt)
Thu Sep 8 15:23:00 1994

Date: Thu, 8 Sep 1994 21:21:07 +0200
Errors-To: listmaster@www0.cern.ch
Errors-To: listmaster@www0.cern.ch
Reply-To: paul@cminds.com
From: Paul Everitt <paul@cminds.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>


On Thu, 8 Sep 1994, Nick Arnett wrote:

> Does anyone know of an engine that does this yet?  We're talking with some
> SGML experts so that we can figure out the right technical strategy and
> we're talking with everyone on this list about HTML...

Check out Harvest:
	http://rd.cs.colorado.edu/harvest/

It has a gatherer that uses Essence to do customized extraction.  Lots of 
theory on it, all of which makes me dizzy.  I'm beta testing Harvest now, 
and it is some pretty powerful stuff, with strong indexing, replication, 
caching, etc.

> Clearly the potential is tremendous, but the search engines have to have a
> document model that mirrors the right aspects of structured text.

The Harvest system first uses heuristics to determine what the file is -- 
HTML, FAQ, etc. -- then, since it knows about the structure of those 
files, can extract type-specific info.  I'm using it for building in 
bibliographic information using (as you mentioned below) META tags.

> FYI, most engines at best just have the notion of "zones" -- phrases,
> sentences and paragraphs -- and attributes, which would be fielded
> meta-information.  We are designing a generic capability to take header
> information from an HTML document and put it into our attribute fields.
> I'm digging up the old "META" discussion to see what, if anything, we
> decided is a minimal standard.

It is pretty interesting how Harvest does this.  It makes what is known 
as a SOIF object (like an IAFA template) that summarizes the object.  
Lots of things happen after that, but it allows for sophisticated queries.

Paul Everitt             V 703.785.7384  Email Paul.Everitt@cminds.com
Connecting Minds, Inc.   F 703.785.7385  WWW   http://www.cminds.com/ 


home help back first fref pref prev next nref lref last post