[107644] in Cypherpunks
Re: document similarity digest functions
daemon@ATHENA.MIT.EDU (Adam Back)
Wed Jan 20 08:18:41 1999
Date: Thu, 21 Jan 1999 11:45:02 GMT
From: Adam Back <aba@dcs.ex.ac.uk>
To: jim@acm.org
CC: cypherpunks@cyberpass.net
In-reply-to: <36A50232.54C31524@acm.org> (message from Jim Gillogly on Tue, 19
Jan 1999 14:07:46 -0800)
Reply-To: Adam Back <aba@dcs.ex.ac.uk>
Jim Gillogly <jim@acm.org> writes:
> Adam Back suggests:
> > A stateful method of doing this would be to use one of the letter
> > triple document similarity metrics, to compare the given message for
> > closeness to previously posted documents.
> >
> > But then you need to keep around the previous documents and/or the
> > statistics derived from them to do the comparison. What would be
> > nicer would be a stateless hash function characterising the document
> > and retaining the ability to make reasonably accurate document
> > similarity estimations based on the outputs only.
>
> I don't see how the hash suggestion is more stateless than keeping
> around statistics derived from previously posted documents.
I tend to agree. One distiction is the that digest comparison
function can be very simple, saying counting the number of similar
bits. However it seems harder to design digest functions with such
simple comparison functions.
The example digest function I gave had a more complex comparison
function (involving diff etc), but was more a digested representation
of the document than stastical information about the document.
An interesting problem is that the comparison function (whether based
on statistics or digests) is going to be used to pair-wise compare
100s or 1000s of documents in parallel so you ideally want a parallel
comparison algorithm which can efficiently comparing multiple
digests/sets of statistics in parallel.
Adam