[32597] in Perl-Users-Digest
Perl-Users Digest, Issue: 3870 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sun Jan 27 21:09:17 2013
Date: Sun, 27 Jan 2013 18:09:05 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Sun, 27 Jan 2013 Volume: 11 Number: 3870
Today's topics:
Re: Trouble with embedded whitespace in filenames using <hjp-usenet2@hjp.at>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Sun, 27 Jan 2013 23:39:30 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Trouble with embedded whitespace in filenames using File::Find
Message-Id: <slrnkgbb52.n1g.hjp-usenet2@hrunkner.hjp.at>
On 2013-01-24 15:11, Charlton Wilbur <cwilbur@chromatico.net> wrote:
>>>>>> "RW" == Rainer Weikusat <rweikusat@mssgmbh.com> writes:
> RW> MD5 (or any other hashing algorithm) is a lot more expensive
> RW> than a comparison and especially so if MD5 needs to process 2G
> RW> of data while the comparison would only need 8K.
>
> You make several unfounded assumptions here.
[...]
> Two, that the number of comparisons is small. The more comparisons you
> have, the more the advantage goes to the hashing algorithm. If you have
> 2 files, it is best to read the first 8K of each and compare them,
> since, as you note, odds are that any differences will appear early on.
> If you have 1000 files, reading the first 8K of each file for
> comparison purposes means a great deal of seeking and reading;
It's about the same amount of seeking and a lot less reading than
computing a hash of each of the 1000 files. At least if the files are a
lot larger than 8k.
> and then you either store the first 8K, leading to a large working set
> (and the first time you swap, you've lost anything you won by avoiding
> calculating hashes),
8k * 1000 is 8 MB. That's negligible. And you only have to store this if
there are actually 1000 files of the same size.
There is also a hybrid approach:
For each group of files of the same size, you could initially read only
the first 8k (or some other size large enough to find the first
difference with a high probability, but small enough to be dwarfed by
the overhead of open(2)), and if those are the identical, switch to
computing a hash (and as Ben said, you can use something like SHA512 -
where a collision is IMHO less likely than a false positive due to a
hardware or software error).
hp
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 3870
***************************************