[107624] in Cypherpunks
Re: document similarity digest functions (Re: attacks and counter-measures)
daemon@ATHENA.MIT.EDU (Adam Back)
Tue Jan 19 19:08:46 1999
Date: Tue, 19 Jan 1999 23:43:28 GMT
From: Adam Back <aba@dcs.ex.ac.uk>
To: ant@notatla.demon.co.uk
Cc: remailer-operators@anon.lcs.mit.edu, cypherpunks@cyberpass.net
In-reply-to: <199901192246.WAA04229@notatla.demon.co.uk> (message from
Antonomasia on Tue, 19 Jan 1999 22:46:48 GMT)
Reply-To: Adam Back <aba@dcs.ex.ac.uk>
Ant writes on remop:
> How about this ?
>
> 1) Feed all docs being processed through a formatter to
> remove all whitespace (space,tab,nl,cr). Maybe squash cases.
>
> 2) Use some form of chunk recognition to slice the doc into
> moderate sized chunks. Something like paragraphs, but based
> on the data after step 1.
Guess that approach (paragraph based selection of chunks) might help
recovering from differences, as the boundaries are more likely to
correspond to intentional human inserted breaks than line breaks.
> This could be delimiting chunks by
> short strings kept (secretly) in a local file. These would not
> need to be shared between remailers and could undergo gradual
> change. The strings might be a few characters long and selected
> from past postings at random to represent what is found in
> real traffic.
Interesting idea, a keyed similarity preserving message digest
function to frustrate the attacker trying to avoid creating
collisions, the key selecting where the chunks start. An attacker not
knowing where the chunks start would find it more difficult to
introduce differences.
Another measure might be to rate a documents repetitiveness. For
example by checking the similarity of the first and second half of the
document or of the first and second half of the similarity preserving
message digest function if it's output allows this. (A person trying
to insert garbage often does a cut and paste from somewhere -- the
current document is most conveniently to hand.)
Another measure might be a keyboard `splurge' detector -- try to
measure if someone is randomly dragging their fingers across the
keyboard, eg sequences like asdf (for touch typists:-)
> 3) Hash each chunk and store the document description as the list
> of hashes, with an expiry period (say from now to now +3 weeks).
> This would allow FAQs posted monthly to get out each time, and those
> weekly to get out monthly or after significant change.
> If size of the hash collection is thought important it's probably
> safe in this case to fold or truncate the md5. False matches of
> a single chunk will be unimportant.
Folding could also be a feature if you were intending to share the
hash outputs without leaking the message content to brute force
attacks.
> 4) Comparison of a new doc to the records of previous docs would
> result in rejection if some high-ish fraction of the chunks
> matched those of a previous doc. (Order should probably matter.)
>
> Measures by the multi-poster clearly include adding large, varying
> blocks of padding to the messages. I doubt this is beatable.
My message crossed with yours: I reckon the count of `diff'
differences would still consider messages with large chunks of
consequtive garbage inserted as similar.
Adam