[107625] in Cypherpunks

home help back first fref pref prev next nref lref last post

Re: doc. similarity fns, recovering from differences (Re: document similarity digest functions) (Re: attacks and counter-measures)

daemon@ATHENA.MIT.EDU (Igor Chudov @ home)
Tue Jan 19 20:18:01 1999

To: aba@dcs.ex.ac.uk (Adam Back)
Date: Tue, 19 Jan 1999 18:54:31 -0600 (CST)
Cc: sen_ml@eccosys.com, remailer-operators@anon.lcs.mit.edu,
        cypherpunks@cyberpass.net
In-Reply-To: <199901192241.WAA13629@server.eternity.org> from "Adam Back" at Jan 19, 99 10:41:13 pm
From: ichudov@Algebra.Com (Igor Chudov @ home)
Reply-To: ichudov@Algebra.Com (Igor Chudov @ home)

Adam Back wrote:
> I wrote:
> 
> > The problem is that it appears difficult to recover from
> > differences.  This might be simplified if the document size were
> > considered as part of the output, as then the comparison function
> > can try to re-sync by scanning forward in in multiples of group bits
> > for the same bit sequence, and doing a bit compare of the number of
> > bits allocated.  (I'll try this out next).
> 
> So lets consider documents X and X' which are 4 and 5 lines
> respectively where the difference is that there is an inserted line.
> 
> X
> ----------------------------------------------------------------------
> The quick
> brown fox
> jumped over
> the lazy dog.
> ----------------------------------------------------------------------
> 
> X'
> ----------------------------------------------------------------------
> Test
> The quick
> brown fox
> jumped over
> the lazy dog.
> ----------------------------------------------------------------------
> 
> Using the same output size as before (256 bits), the group of bits per
> line for X is 64 and for X' is 52.
> 
> The line hashes are:
> 
> sha1(The quick) =	30b59bc6c7c16 222 83a23950f2984bc5cd4fdd51
> sha1(brown fox) =	c6c485beff580 184 3f909254c5405ab2a082ce51
> sha1(jumped over) =	a59d17dccd51d 18f 0eece0d07ab0c3d86dbbcc72
> sha1(the lazy dog.) =	0df9e6fcb711c c6a 8267c700617adc2cb619e204
> sha1(Test) =		640ab2bae07be dc4 c163f679a746f7ab7fb5d1fa
> 
> and the outputs are:
> 
> f(X) = 4, 30b59bc6c7c16222 c6c485beff580184 a59d17dccd51d18f 0df9e6fcb711cc6a
> f(X')= 5, 640ab2bae07be 30b59bc6c7c16 c6c485beff580 a59d17dccd51d 0df9e6fcb711
> 
> the comparison function can then use the line count information to
> work out how far to skip (64 bits and 52 bites respectively) and how
> many bits to compare (52 bits).  (The last output is truncated to 256
> bits if it would be longer).  The comparison algorithm can then do
> similar analysis to Larry Wall's diff program: put the above on
> separate lines and truncate all lines to the shortest line length, and
> you can do things like:
> 
> % cat X
> 30b59bc6c7c1
> c6c485beff58
> a59d17dccd51
> 0df9e6fcb711
> % cat Xp
> 640ab2bae07b
> 30b59bc6c7c1
> c6c485beff58
> a59d17dccd51
> 0df9e6fcb711
> % diff X Xp
> 0a1
> > 640ab2bae07b
> 
> which tells you that one line was added.  Or: 
> 
> % diff X Xp | grep -v '[<>]' | wc -l
> 1
> 
> which tells you how many differences there were in the default `diff'
> sense of differences.
> 
> The larger the document the less accurate the comparisons that can be
> made based on the digest function outputs.  The above digest scheme
> also allows different sized outputs to be compared so one could have a
> variable output size, with larger outputs for larger files, say
> together with imposing some maximum output size.

Adam,

An obvious way to defeat this function is to insert random spaces,
or perhaps random characters/letters into every line.

Another consideration here is that a sufficiently powerful comparison
function may actually give false positives. For instance, it would
treat one-line followups as identical messages.

	- Igor.


home help back first fref pref prev next nref lref last post