[32597] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3870 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sun Jan 27 21:09:17 2013

Date: Sun, 27 Jan 2013 18:09:05 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sun, 27 Jan 2013     Volume: 11 Number: 3870

Today's topics:
    Re: Trouble with embedded whitespace in filenames using <hjp-usenet2@hjp.at>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Sun, 27 Jan 2013 23:39:30 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Trouble with embedded whitespace in filenames using File::Find
Message-Id: <slrnkgbb52.n1g.hjp-usenet2@hrunkner.hjp.at>

On 2013-01-24 15:11, Charlton Wilbur <cwilbur@chromatico.net> wrote:
>>>>>> "RW" == Rainer Weikusat <rweikusat@mssgmbh.com> writes:
>    RW> MD5 (or any other hashing algorithm) is a lot more expensive
>    RW> than a comparison and especially so if MD5 needs to process 2G
>    RW> of data while the comparison would only need 8K.
>
> You make several unfounded assumptions here.
[...]
> Two, that the number of comparisons is small.  The more comparisons you
> have, the more the advantage goes to the hashing algorithm.  If you have
> 2 files, it is best to read the first 8K of each and compare them,
> since, as you note, odds are that any differences will appear early on.
> If you have 1000 files, reading the first 8K of each file for
> comparison purposes means a great deal of seeking and reading;

It's about the same amount of seeking and a lot less reading than
computing a hash of each of the 1000 files. At least if the files are a
lot larger than 8k.


> and then you either store the first 8K, leading to a large working set
> (and the first time you swap, you've lost anything you won by avoiding
> calculating hashes),

8k * 1000 is 8 MB. That's negligible. And you only have to store this if
there are actually 1000 files of the same size.

There is also a hybrid approach:

For each group of files of the same size, you could initially read only
the first 8k (or some other size large enough to find the first
difference with a high probability, but small enough to be dwarfed by
the overhead of open(2)), and if those are the identical, switch to
computing a hash (and as Ben said, you can use something like SHA512 -
where a collision is IMHO less likely than a false positive due to a
hardware or software error).

	hp


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3870
***************************************


home help back first fref pref prev next nref lref last post