[107619] in Cypherpunks
doc. similarity fns, recovering from differences (Re: document similarity digest functions) (Re: attacks and counter-measures)
daemon@ATHENA.MIT.EDU (Adam Back)
Tue Jan 19 18:00:56 1999
Date: Tue, 19 Jan 1999 22:41:13 GMT
From: Adam Back <aba@dcs.ex.ac.uk>
To: sen_ml@eccosys.com
Cc: remailer-operators@anon.lcs.mit.edu, cypherpunks@cyberpass.net
In-reply-to: <199901192110.VAA11246@server.eternity.org> (message from Adam
Back on Tue, 19 Jan 1999 21:10:13 GMT)
Reply-To: Adam Back <aba@dcs.ex.ac.uk>
I wrote:
> The problem is that it appears difficult to recover from
> differences. This might be simplified if the document size were
> considered as part of the output, as then the comparison function
> can try to re-sync by scanning forward in in multiples of group bits
> for the same bit sequence, and doing a bit compare of the number of
> bits allocated. (I'll try this out next).
So lets consider documents X and X' which are 4 and 5 lines
respectively where the difference is that there is an inserted line.
X
----------------------------------------------------------------------
The quick
brown fox
jumped over
the lazy dog.
----------------------------------------------------------------------
X'
----------------------------------------------------------------------
Test
The quick
brown fox
jumped over
the lazy dog.
----------------------------------------------------------------------
Using the same output size as before (256 bits), the group of bits per
line for X is 64 and for X' is 52.
The line hashes are:
sha1(The quick) = 30b59bc6c7c16 222 83a23950f2984bc5cd4fdd51
sha1(brown fox) = c6c485beff580 184 3f909254c5405ab2a082ce51
sha1(jumped over) = a59d17dccd51d 18f 0eece0d07ab0c3d86dbbcc72
sha1(the lazy dog.) = 0df9e6fcb711c c6a 8267c700617adc2cb619e204
sha1(Test) = 640ab2bae07be dc4 c163f679a746f7ab7fb5d1fa
and the outputs are:
f(X) = 4, 30b59bc6c7c16222 c6c485beff580184 a59d17dccd51d18f 0df9e6fcb711cc6a
f(X')= 5, 640ab2bae07be 30b59bc6c7c16 c6c485beff580 a59d17dccd51d 0df9e6fcb711
the comparison function can then use the line count information to
work out how far to skip (64 bits and 52 bites respectively) and how
many bits to compare (52 bits). (The last output is truncated to 256
bits if it would be longer). The comparison algorithm can then do
similar analysis to Larry Wall's diff program: put the above on
separate lines and truncate all lines to the shortest line length, and
you can do things like:
% cat X
30b59bc6c7c1
c6c485beff58
a59d17dccd51
0df9e6fcb711
% cat Xp
640ab2bae07b
30b59bc6c7c1
c6c485beff58
a59d17dccd51
0df9e6fcb711
% diff X Xp
0a1
> 640ab2bae07b
which tells you that one line was added. Or:
% diff X Xp | grep -v '[<>]' | wc -l
1
which tells you how many differences there were in the default `diff'
sense of differences.
The larger the document the less accurate the comparisons that can be
made based on the digest function outputs. The above digest scheme
also allows different sized outputs to be compared so one could have a
variable output size, with larger outputs for larger files, say
together with imposing some maximum output size.
Adam