[140768] in North American Network Operators' Group
Re: Had an idea - looking for a math buff to tell me if it's possible
daemon@ATHENA.MIT.EDU (Valdis.Kletnieks@vt.edu)
Wed May 18 19:44:30 2011
To: Heath Jones <hj1980@gmail.com>
In-Reply-To: Your message of "Thu, 19 May 2011 00:26:26 BST."
<BANLkTinBrmzFWi7y=1goAyHrT7dCi5SA8w@mail.gmail.com>
From: Valdis.Kletnieks@vt.edu
Date: Wed, 18 May 2011 19:42:43 -0400
Cc: nanog <nanog@nanog.org>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
--==_Exmh_1305762163_3336P
Content-Type: text/plain; charset=us-ascii
On Thu, 19 May 2011 00:26:26 BST, Heath Jones said:
> I wonder if this is possible:
>
> - Take a hash of the original file. Keep a counter.
> - Generate data in some sequential method on sender side (for example simply
> starting at 0 and iterating until you generate the same as the original
> data)
> - Each time you iterate, take the hash of the generated data. If it matches
> the hash of the original file, increment counter.
> - Send the hash and the counter value to recipient.
> - Recipient performs same sequential generation method, stopping when
> counter reached.
MD5 is a 128 bit hash.
2^128 is 340,282,366,920,938,463,463,374,607,431,768,211,456 - you're welcome
to iterate that many times to find a duplicate. You may get lucky and get a hit
in the first trillion or so attempts - but you may get unlucky and not get a
hit until the *last* few trillion attempts. On average you'll have to iterate
about half that huge number before you get a hit.
And it's lossy - if you hash all the possible 4K blocks with MD5, you'll find
that each of those 2^128 hashes has been hit about 256 times - and no
indication in the hash of *which* of the 256 colliding 4K blocks you have on
this iteration. (The only reason that companies can do block-level
de-duplication by saving a hash as an index to one copy shared by all blocks
with the same hash value is because you have a *very small* fraction of the
possibilities covered, so if you saved a 4K block of data from somebody's
system32 folder under a given MD5 hash, it's *far* more likely that another
block with that same hash is from another copy of another identical system32
folder, than it is an actual accidental collision.)
Protip: A good hash function is by definition one-way - given the data, it's
easy to generate the hash - but reversing it to find the "pre-image" (the data
that *generated* the hash) is massively difficult.
--==_Exmh_1305762163_3336P
Content-Type: application/pgp-signature
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Exmh version 2.5 07/13/2001
iD8DBQFN1FlzcC3lWbTT17ARAvJ0AJ9YA5bRvbb/6o4Ozdg7wE9I0F4dIACg4ZkA
nxDFYaY9/4gFuwoPLbHi564=
=4uhV
-----END PGP SIGNATURE-----
--==_Exmh_1305762163_3336P--