[107618] in Cypherpunks
Re: document similarity digest functions
daemon@ATHENA.MIT.EDU (Jim Gillogly)
Tue Jan 19 17:28:00 1999
Date: Tue, 19 Jan 1999 14:07:46 -0800
From: Jim Gillogly <jim@acm.org>
To: cypherpunks@cyberpass.net
Reply-To: Jim Gillogly <jim@acm.org>
Adam Back suggests:
> A stateful method of doing this would be to use one of the letter
> triple document similarity metrics, to compare the given message for
> closeness to previously posted documents.
>
> But then you need to keep around the previous documents and/or the
> statistics derived from them to do the comparison. What would be
> nicer would be a stateless hash function characterising the document
> and retaining the ability to make reasonably accurate document
> similarity estimations based on the outputs only.
I don't see how the hash suggestion is more stateless than keeping
around statistics derived from previously posted documents. If the
total length of the statistics is less than the total hashage Adam
suggests (32 bytes), then stats would be better, since they could
distinguish copies that were purposely slightly altered. As we've
seen, spam and spam prevention are not static: if the spammers need
to insert a random number in each 1/5 of the message to get the msgs
through, they will.
I'd envision something like Arthur Samuel did for his "signature table"
checkers learning program back in the -- umm, was it -- 60's: pick a
bunch of metrics that look plausible and collect them, then do a
multi-dimensional distance comparison (suitably weighted) from the
target sample to the candidates. This is a bit more work than the
straight compares, but not much more.
I think a relatively small number of "features" could be used to
pretty well identify the seventh (say) copy of most repeated messages.
Off the top of my head:
- length of body
- number of digits in body
- number of lower case in body
- number of upper case in body
- number of spaces in body
- number of words in all upper case
- number of newlines in body
- number of other printables in body
- short hash of most recent hop
- short hash of phone number(s) (smurged together, perhaps)
- other measures seasoned to taste
The counts might be like integer parts of binary logs, for
example, to cut down the total length.
The numeric values would be "distanced" and the hashes would be
"binaried" (with appropriate weights). In the Samuel paper the
weights were learned: during the teaching phase the tutor would
tell the program which messages were similar, and the signature
table mechanism would tune the weights to fit. I wouldn't think
anything so elaborate would be necessary here.
--
Jim Gillogly
Sterday, 28 Afteryule S.R. 1999, 21:52
12.19.5.15.13, 13 Ben 6 Muan, Seventh Lord of Night