[107625] in Cypherpunks
Re: doc. similarity fns, recovering from differences (Re: document similarity digest functions) (Re: attacks and counter-measures)
daemon@ATHENA.MIT.EDU (Igor Chudov @ home)
Tue Jan 19 20:18:01 1999
To: aba@dcs.ex.ac.uk (Adam Back)
Date: Tue, 19 Jan 1999 18:54:31 -0600 (CST)
Cc: sen_ml@eccosys.com, remailer-operators@anon.lcs.mit.edu,
cypherpunks@cyberpass.net
In-Reply-To: <199901192241.WAA13629@server.eternity.org> from "Adam Back" at Jan 19, 99 10:41:13 pm
From: ichudov@Algebra.Com (Igor Chudov @ home)
Reply-To: ichudov@Algebra.Com (Igor Chudov @ home)
Adam Back wrote:
> I wrote:
>
> > The problem is that it appears difficult to recover from
> > differences. This might be simplified if the document size were
> > considered as part of the output, as then the comparison function
> > can try to re-sync by scanning forward in in multiples of group bits
> > for the same bit sequence, and doing a bit compare of the number of
> > bits allocated. (I'll try this out next).
>
> So lets consider documents X and X' which are 4 and 5 lines
> respectively where the difference is that there is an inserted line.
>
> X
> ----------------------------------------------------------------------
> The quick
> brown fox
> jumped over
> the lazy dog.
> ----------------------------------------------------------------------
>
> X'
> ----------------------------------------------------------------------
> Test
> The quick
> brown fox
> jumped over
> the lazy dog.
> ----------------------------------------------------------------------
>
> Using the same output size as before (256 bits), the group of bits per
> line for X is 64 and for X' is 52.
>
> The line hashes are:
>
> sha1(The quick) = 30b59bc6c7c16 222 83a23950f2984bc5cd4fdd51
> sha1(brown fox) = c6c485beff580 184 3f909254c5405ab2a082ce51
> sha1(jumped over) = a59d17dccd51d 18f 0eece0d07ab0c3d86dbbcc72
> sha1(the lazy dog.) = 0df9e6fcb711c c6a 8267c700617adc2cb619e204
> sha1(Test) = 640ab2bae07be dc4 c163f679a746f7ab7fb5d1fa
>
> and the outputs are:
>
> f(X) = 4, 30b59bc6c7c16222 c6c485beff580184 a59d17dccd51d18f 0df9e6fcb711cc6a
> f(X')= 5, 640ab2bae07be 30b59bc6c7c16 c6c485beff580 a59d17dccd51d 0df9e6fcb711
>
> the comparison function can then use the line count information to
> work out how far to skip (64 bits and 52 bites respectively) and how
> many bits to compare (52 bits). (The last output is truncated to 256
> bits if it would be longer). The comparison algorithm can then do
> similar analysis to Larry Wall's diff program: put the above on
> separate lines and truncate all lines to the shortest line length, and
> you can do things like:
>
> % cat X
> 30b59bc6c7c1
> c6c485beff58
> a59d17dccd51
> 0df9e6fcb711
> % cat Xp
> 640ab2bae07b
> 30b59bc6c7c1
> c6c485beff58
> a59d17dccd51
> 0df9e6fcb711
> % diff X Xp
> 0a1
> > 640ab2bae07b
>
> which tells you that one line was added. Or:
>
> % diff X Xp | grep -v '[<>]' | wc -l
> 1
>
> which tells you how many differences there were in the default `diff'
> sense of differences.
>
> The larger the document the less accurate the comparisons that can be
> made based on the digest function outputs. The above digest scheme
> also allows different sized outputs to be compared so one could have a
> variable output size, with larger outputs for larger files, say
> together with imposing some maximum output size.
Adam,
An obvious way to defeat this function is to insert random spaces,
or perhaps random characters/letters into every line.
Another consideration here is that a sufficiently powerful comparison
function may actually give false positives. For instance, it would
treat one-line followups as identical messages.
- Igor.