[107620] in Cypherpunks
preprocessing for similarity preserving digest fns
daemon@ATHENA.MIT.EDU (Adam Back)
Tue Jan 19 18:31:51 1999
Date: Tue, 19 Jan 1999 23:15:56 GMT
From: Adam Back <aba@dcs.ex.ac.uk>
To: cypherpunks@cyberpass.net
Cc: remailer-operators@anon.lcs.mit.edu
Reply-To: Adam Back <aba@dcs.ex.ac.uk>
For email applications one might like to pre-process the text files to
ignore blank lines, ignore leading spaces, remove common message
quotes (ie remove leading '> ' and '>> ' etc).
Then documents such as :
======================================================================
I think blah on topic foo...
[100 lines of opinion]
======================================================================
======================================================================
me too!
> I think blah on topic foo...
> [100 lines of opinion]
======================================================================
would be correctly considered very similar. Unsolicited marketing
form letters such as:
======================================================================
Dear Fred <fred@email1>
Here is an opportunity you can not afford to miss...
[100 lines of s**m]
======================================================================
======================================================================
Dear Fred Bloggs <fred@email2>,
Here is an opportunity you can not afford to miss...
[100 lines of s**m]
======================================================================
would similarly be flagged as very similar.
One can share digests without breaching privacy and revealing content
whilst still allowing recipients to recognize sufficiently similar
documents and so more easily allowing one to for example recognize
form junk mail, and counter simple attempts to by-pass cryptographic
message digest based replicated posting systems.
One problem is that an attacker could brute force lines of text to
reverse the hash: few output bits per line (say < 4) can reduce this
risk in that naturally occuring collisions are frequent.
I suspect one could drop in a function such as the above into a
procmail based recipie for for killing duplicates and not miss much of
interest (other than people who have a habit of breaching quoting
etiquette by quoting 100s of lines of text to give 1 liner replies
(often not a big loss anyway) -- tho' perhaps the surplus quoted
material could be removed by a pre-processing stage which looks for no
interspersed comments and quoted text exceeding the text).
Adam