[33038] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 4314 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sun Nov 23 03:09:17 2014

Date: Sun, 23 Nov 2014 00:09:02 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sun, 23 Nov 2014     Volume: 11 Number: 4314

Today's topics:
    Re: A hash of references to arrays of references to has <hjp-usenet3@hjp.at>
    Re: A hash of references to arrays of references to has <gamo@telecable.es>
    Re: A hash of references to arrays of references to has <gravitalsun@hotmail.foo>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Sat, 22 Nov 2014 13:26:31 +0100
From: "Peter J. Holzer" <hjp-usenet3@hjp.at>
Subject: Re: A hash of references to arrays of references to hashes... is there a better way?
Message-Id: <slrnm7107n.l3u.hjp-usenet3@hrunkner.hjp.at>

On 2014-11-22 05:20, Robbie Hatley <see.my.sig@for.my.address> wrote:
> Greetings, group. I've been away several years, but I'm back. :-)
> I've changed jobs (several times), changed homes (several times).
> I've also changed programming platforms. I got rid of Windows 2000
> and djgpp, and I'm currently using the following 2 platforms:
> 1. Perl on Cygwin on Win 8.1 on notebook computer
> 2. Perl on Point Linux on desktop computer
>
> I'd gotten somewhat good at Perl, 2005-2008, stopped using it
> for a while (got distracted), but started getting back into it
> November 2014. I'm taking up a program I started writing in 2005
> but never finished. (At least not in Perl; I have a version
> written in C++, but that only works on djgpp on Win2K; not portable.)
> This program I'm writing will eventually be a duplicate file finding
> and erasing program. I've pasted what I've written so far at the end
> of this message for reference.
>
> This program uses a hash of references to arrays of references
> to hashes. (Gulp.)

Don't be afraid of nested data structures. Just think of the lower
levels as black boxes when you look at a higher level. 

Your lowest level hashes contain information about files. So think about
each of them as a file. At the next level you have lists of files, and
finally at the highest level you have a mapping from a sizes to lists
(of files of the same size).

In object oriented programming you would define a class for each of
these to formalize this abstraction. You could do that here, too.
But for now let's just keep the "classes" in our head.

> Seems to me there's got to be an easier way.  In C++ I just use
> "multimaps" from the C++ Standard Template Library. Maybe there's
> something like that in CPAN but I haven't looked yet.

I don't really think that helps here. On the contrary, it breaks the
nice abstractions and obscures what the program is supposed to do.


> My question is this: do the programmers here see any places in what
> I've written below where things could be expressed more briefly or
> clearly?

Yes.


> The part where I'm adding file-record hashes to arrays seems clunky
> to me. The idea is this: We're riffling through all files in the
> current directory, storing file records as hashes in arrays of
> same-file-size, with the arrays being inserted into a hash keyed
> by file size. So, if a file of same size as the current file has
> been processed already, then add record for current file to the
> appropriate array; otherwise, create a new array and insert it
> into the outer hash.

The idea is fine. That's usually the first step in finding duplicates
since it is fast.

> But the way I have this implemented below
> is kinda ugly. Is there a better way of doing this?

Yes.

Don't copy the whole array each time, use push.

So instead of 

>     if ($CurDirFiles{$Size})
>     {
>        $CurDirFiles{$Size} =
> 	  [
> 	     @{$CurDirFiles{$Size}},
> 	     {
>                  "Date" => $ModDate,
>                  "Time" => $ModTime,
>                  "Type" => $Type,
>                  "Size" => $Size,
>                  "Attr" => $mode,
>                  "Name" => $FileName
>               }
> 	  ];
>     }
>     else
>     {
>        $CurDirFiles{$Size} =
> 	  [
> 	     {
>                  "Date" => $ModDate,
>                  "Time" => $ModTime,
>                  "Type" => $Type,
>                  "Size" => $Size,
>                  "Attr" => $mode,
>                  "Name" => $FileName
>               }
> 	  ];
>     }

write

      $CurDirFiles{$Size} = [] unless $CurDirFiles{$Size};
      push @{ $CurDirFiles{$Size} },
  	   {
               "Date" => $ModDate,
               "Time" => $ModTime,
               "Type" => $Type,
               "Size" => $Size,
               "Attr" => $mode,
               "Name" => $FileName
           };

(in newer versions of perl, you can also push onto an array reference,
so instead of 
    push @{ $CurDirFiles{$Size} }, ...
you can also write 
    push $CurDirFiles{$Size}, ...
which eliminates some line noise. I think it's still considered
experimental, though.)


> And the following line looks too complicated to me; it works,
> but is there a better way to do this?
>
> foreach my $HashRef (@{$CurDirFiles{$Size}})

No. I would just add some spaces to make it more readable:

  foreach my $HashRef (@{ $CurDirFiles{$Size} })

Or you could split it into two lines:

    my $FileList = $CurDirFiles{$Size};
    foreach my $File (@$FileList) ...

(Note that I have also chosen descriptive variable names: "$HashRef" is
a terrible name: It tells you nothing about what the variable
represents, only about the implementation. Also, I've stuck with your
capitalization scheme for consistency, but starting a variable name with
an upper case letter is very unusual: That's normally only used for
package names (and constants, which are often written in all-caps). I
(and Damian Conway) prefer lower case only and underscores for variable
names. Some people use camel case starting with a lower case character.)


>     my $Type;
>
>     if ( -d _ )
>     {
>        $Type = "Dir";
>     }
>     else
>     {
>        $Type = "File";
>     }

These 10 lines can be replaced by one:

    my $Type = -d _ ? "Dir" : "File";


>     ##### Could this next line be written a better way? #####
>     foreach my $HashRef (@{$CurDirFiles{$Size}})
>     {
>        print($$HashRef{Date}, "  ");
>        print($$HashRef{Time}, "  ");
>        print($$HashRef{Type}, "  ");
>        print($$HashRef{Size}, "  ");
>        print($$HashRef{Attr}, "  ");
>        print($$HashRef{Name}, "\n");
>     }

First, don't write $$HashRef{$key}, write $HashRef->{$key}. It's one
character more to type but much clearer.

Then, either use a loop on the keys:

      foreach my $HashRef (@{$CurDirFiles{$Size}})
      {
         for my $Field (qw(Date Time Type Size Attr Name)) 
         {
             print($HashRef->{$Field}, "  ");
         }
         print "\n";
      }

(this leaves an extra blank at the end of the line, but I consider this
acceptable)

or, even better, use a hash slice and join

      foreach my $HashRef (@{$CurDirFiles{$Size}})
      {
         print join(" ",
                    @{ $HashRef }{qw(Date Time Type Size Attr Name)}),
               "\n";
      }

(no extra blank here.)

        hp


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) |                    | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


------------------------------

Date: Sat, 22 Nov 2014 15:16:22 +0100
From: gamo <gamo@telecable.es>
Subject: Re: A hash of references to arrays of references to hashes... is there a better way?
Message-Id: <m4q5rj$rof$1@speranza.aioe.org>

El 22/11/14 a las 06:20, Robbie Hatley escribió:
> # Plan: Recursively descend directory tree starting from current
> working       #
> # directory, and make a master list of all files encountered on this
> branch.   #
> # Order the list by size.  Within each size group, compare each file,
> from     #
> # left to right, to all the files to its right.  If a duplicate pair is
> found, #
> # alert user and get user input.  Give user these
> choices:                     #
> # 1. Erase left
> file                                                           #
> # 2. Erase right
> file                                                          #
> # 3. Ignore this pair of duplicate files and move to
> next                      #
> # 4.
> Quit                                                                      #
> # If user elects to delete a file, delete it, then move to next
> duplicate.     #

This is O(2), which usually means wrong. You want to compare all with
all and expect that dupes appair in pairs. First, you must collect the
info that makes unique the file: the filename and the file content.
Then you have $filename from readdir and `sha3sum` of that file's content.
You can add 1 for each apparence of $hash{ $filename.'+'.$hash_sha3 }++ 
; treat that hash by value, and you are almost done.

-- 
http://www.telecable.es/personales/gamo/


------------------------------

Date: Sat, 22 Nov 2014 17:52:32 +0200
From: George Mpouras <gravitalsun@hotmail.foo>
Subject: Re: A hash of references to arrays of references to hashes... is there a better way?
Message-Id: <m4qbg4$pee$1@news.ntua.gr>

The following is more secure because it is based on the real file 
content instead of file metadata (size, data, etc)

it can be written to be much more faster, but here is only a draft as 
proof of concept. The dup files are written to an external script that 
you can review it and executed manual if you want.




#!/usr/bin/perl
# Find duplicate files, and write them to a shell script
use strict;
use warnings;
use feature qw/say/;
use File::Find;
use Digest::MD5;

my $dir_to_search_for_duplicates = 'g:/temp';
my %DB;
my $md5 = Digest::MD5->new;
my %cfg; @cfg{qw/ext header cmd DB/} =
$^O=~/(?i)MS/ ?
('bat', '@echo off', 'del /q /f'):
('sh' , '#!/bin/sh', 'rm -f');
$cfg{script} = "delete_duplicate_file.$cfg{ext}";
open SCRIPT, '>', $cfg{script} or die "Oups \"$!\"\n";
say  SCRIPT $cfg{header};

File::Find::find({
wanted   => \&Find_dups,
no_chdir =>1,
bydepth  =>0,
follow   =>0},
$dir_to_search_for_duplicates );
close SCRIPT;

sub Find_dups {
return unless -f $File::Find::name;
return unless open FILE, '<:raw', $File::Find::name;
$md5->addfile(\*FILE);
my $sig = $md5->digest;
close FILE;
$File::Find::name=~s/\//\\/g if $cfg{ext} eq 'bat';
$DB{$sig}=$File::Find::name,return if ! exists $DB{$sig};
say "Duplicate \"$File::Find::name\" same as \"$DB{$sig}\"";
say SCRIPT "$cfg{cmd} \"$File::Find::name\""
}



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 4314
***************************************


home help back first fref pref prev next nref lref last post