[107615] in Cypherpunks

home help back first fref pref prev next nref lref last post

document similarity digest functions (Re: attacks and counter-measures)

daemon@ATHENA.MIT.EDU (Adam Back)
Tue Jan 19 16:37:31 1999

Date: Tue, 19 Jan 1999 21:10:13 GMT
From: Adam Back <aba@dcs.ex.ac.uk>
To: sen_ml@eccosys.com
Cc: remailer-operators@anon.lcs.mit.edu
Cc: cypherpunks@cyberpass.net
In-reply-to: <19990119110247T.sen_ml@eccosys.com>
Reply-To: Adam Back <aba@dcs.ex.ac.uk>


On remailer-operators I proposed a specialised hash function for use
in place of md5 for detecting and rejecting duplicates:

> > 1.2.1 (proposed) counter-measure to 1.2: use specialised hash
> >       function in place of md5 which produces the same hash for
> >       similar documents to within some tunable metric of document
> >       similarity.  One would want accidental collisions to be
> >       unlikely but collisions on similar documents to be hard to
> >       avoid.

Sen writes on remop:
> sorry about straying from the main point of your post, but:
> 
> any good candidates for this specialised hash function?

An exercise for the reader :-) That is no.  But it shouldn't be that
hard to have a stab at.

A stateful method of doing this would be to use one of the letter
triple document similarity metrics, to compare the given message for
closeness to previously posted documents.

But then you need to keep around the previous documents and/or the
statistics derived from them to do the comparison.  What would be
nicer would be a stateless hash function characterising the document
and retaining the ability to make reasonably accurate document
similarity estimations based on the outputs only.

First we would need to draw up a requirements spec for the function
and then design (or hack-up) a reasonably space and time efficient
function to implement those requirements.

One comment is that we don't necessarily care whether the hash output
for similar documents (similar documents X and X' denoted X ~= X') is
identical or similar as we can adjust our usage of the hash.

i.e. we could envisage a similarity digest function f such ...that if
X ~= X' it would be likey that f(X) ~= f(X') For this type of
similarity function we can choose our threshold for similarity by
setting a threshold for the number of identical bits in f(X) and
f(X').  One could term this type of document similarity function a
stateless, similar output DS function.

Or alternatively we could envisage a similarity function g taking a
threshold parameter t such that given X ~= X', the probability that
g(t,X) = g(t,X') increases with t.  One could term this type of
document similarity function a stateless, identical output DS
function.

Say for example that we wanted to design a similar output DS function
f which would result in one being able to ask questions about the
fraction of equivalent lines of two documents X and X' given only the
outputs f(X) and f(X').

A first stab might be to decide a output size of say 32 bytes (256
bits).  Then compute group size=ceil(output size/#lines) (#lines is no
of lines in document, ceil is round up function, output size is chosen
output size in bits).  Then perform an sha1 on the n document chunks:
n=ceil(output size/group).

Truncate the sha1'd document chunk outputs to group bits and
concatenate them.

This would cope well enough if the documents were the same size.  Take
for example the two documents:

X
----------------------------------------------------------------------
The quick
brown fox
jumped over
the lazy dog.
----------------------------------------------------------------------

X'
----------------------------------------------------------------------
The speedy
brown fox
jumped over
the lazy dog.
----------------------------------------------------------------------

the digests would be (excluding line feeds):

sha1(The quick) =	30b59bc6c7c16222 83a23950f2984bc5cd4fdd51
sha1(brown fox) =	c6c485beff580184 3f909254c5405ab2a082ce51
sha1(jumped over) =	a59d17dccd51d18f 0eece0d07ab0c3d86dbbcc72
sha1(the lazy dog.) =	0df9e6fcb711cc6a 8267c700617adc2cb619e204
sha1(The speedy) =	cfeae2e044c80e0b 972b10a7a0b4ab56a7607574

So the digest function is:

f(X) =  30b59bc6c7c16222 c6c485beff580184 a59d17dccd51d18f 0df9e6fcb711cc6a
f(X') = 30b59bc6c7c16222 c6c485beff580184 a59d17dccd51d18f cfeae2e044c80e0b

The digest function has a number of obvious short falls:

If X differs from X' by an inserted line all lines after that are
considered different.  

Documents of slightly different sizes are not coped with well because
the digested information would would be different.  The problem is
that it appears difficult to recover from differences.  This might be
simplified if the document size were considered as part of the output,
as then the comparison function can try to re-sync by scanning forward
in in multiples of group bits for the same bit sequence, and doing a
bit compare of the number of bits allocated.  (I'll try this out
next).

See if you can think of some strategies for improving on the above
document similarity functions.

Adam


home help back first fref pref prev next nref lref last post