[33050] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 4326 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed Dec 10 21:09:20 2014

Date: Wed, 10 Dec 2014 18:09:09 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Wed, 10 Dec 2014     Volume: 11 Number: 4326

Today's topics:
    Re: Date procedure? <justin.1410@purestblue.com>
    Re: Date procedure? <hhr-m@web.de>
    Re: Date procedure? <gamo@telecable.es>
    Re: Date procedure? <justin.1410@purestblue.com>
    Re: Date procedure? <tuxedo@mailinator.com>
    Re: Date procedure? <tuxedo@mailinator.com>
    Re: Date procedure? <rweikusat@mobileactivedefense.com>
    Re: Date procedure? <tuxedo@mailinator.com>
    Re: Date procedure? <rweikusat@mobileactivedefense.com>
    Re: Date procedure? <afmcc@btinternet.com>
    Re: Date procedure? <tuxedo@mailinator.com>
    Re: Date procedure? <rweikusat@mobileactivedefense.com>
    Re: Date procedure? <netnews@invalid.com>
    Re: Dedup script is finished and works. Critiques? <see.my.sig@for.my.address>
    Re: Dedup script is finished and works. Critiques? <see.my.sig@for.my.address>
    Re: Dedup script is finished and works. Critiques? <see.my.sig@for.my.address>
    Re: Dedup script is finished and works. Critiques? <kaz@kylheku.com>
        Latest version of "dedup", and questions on open files, <see.my.sig@for.my.address>
    Re: Latest version of "dedup", and questions on open fi <kaz@kylheku.com>
    Re: Need a Mac Script to search and print lines in all  <gamo@telecable.es>
    Re: Need a Mac Script to search and print lines in all  <justin.1410@purestblue.com>
    Re: Need a Mac Script to search and print lines in all  <kw@codebykevin.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Mon, 8 Dec 2014 16:44:59 +0000
From: Justin C <justin.1410@purestblue.com>
Subject: Re: Date procedure?
Message-Id: <b51hlb-qgh.ln1@zem.masonsmusic.co.uk>

On 2014-12-08, Tuxedo <tuxedo@mailinator.com> wrote:
> I have an $orig_date variable string contaning for example:
>
> 2014-12-08
>
> I'd like to keep the original format to more easily make use of the same 
> data whenever some other function may require a date for processing and 
> comparing something else.
>
> However, to keep everyone everywhere happy with what they see displayed, 
> can someone here advise me on a procedure to reorganise such formatted date 
> into a more humanly readable string, placing a reformatted date in for 
> example a new $date_en variable:
>
> 8 December 2014 
>
> Maybe there is even some convenient module that can return different local 
> style date conventions or languages based on the yyyy-mm-dd format to print 
> for example:
>
> 8 décembre 2014
> 8 diciembre 2014
> 12/08/2014
> 08/12/2014
>
> etc.
>
> Many thanks for any tips!

perldoc DateTime

would not be a bad place to start.


   Justin.

-- 
Justin C, by the sea.


------------------------------

Date: Tue, 09 Dec 2014 09:16:43 +0100
From: Helmut Richter <hhr-m@web.de>
Subject: Re: Date procedure?
Message-Id: <m66b5j$vl4$1@news.in.tum.de>

Am 09.12.2014 00:36, schrieb Tuxedo:

> People in different cultures may still see the 2014-12-08 a bit odd and
> even mean different dates in some minds, much like the British vs US
> conventions.

Is there anyone in any country who would give the date 2014-12-08 
another meaning than 8 DEC 2014? In comparison: the British vs US 
conventions allow 10/12/14 to mean any of 12 OCT 2014, 10 DEC 2014, or 
14 DEC 2010.

> As such, it may be a good idea to spell out the month.

In English, of course, in order to be compatible with all "different 
cultures" -- among which there are even some that use our calendar with 
month names that have no resemblance with ours, e.g. Ukrainian or 
Lithuanian.

-- 
Helmut Richter



------------------------------

Date: Tue, 09 Dec 2014 10:18:19 +0100
From: gamo <gamo@telecable.es>
Subject: Re: Date procedure?
Message-Id: <m66eoq$eoh$1@speranza.aioe.org>

El 09/12/14 a las 09:16, Helmut Richter escribió:
> Am 09.12.2014 00:36, schrieb Tuxedo:
>
>> People in different cultures may still see the 2014-12-08 a bit odd and
>> even mean different dates in some minds, much like the British vs US
>> conventions.
>
> Is there anyone in any country who would give the date 2014-12-08
> another meaning than 8 DEC 2014? In comparison: the British vs US
> conventions allow 10/12/14 to mean any of 12 OCT 2014, 10 DEC 2014, or
> 14 DEC 2010.
>

 From the wiki:

Date and time (current at page generation)
expressed according to ISO 8601:
Date: 	2014-12-09
Combined date and time in UTC: 	2014-12-09T07:39:22+00:00
2014-12-09T07:39:22Z

If there are differences and small switching costs, it is a
reason to adopt a standard.

-- 
http://www.telecable.es/personales/gamo/
The generation of random numbers is too important to be left to chance


------------------------------

Date: Tue, 9 Dec 2014 09:19:23 +0000
From: Justin C <justin.1410@purestblue.com>
Subject: Re: Date procedure?
Message-Id: <rdrilb-mk1.ln1@zem.masonsmusic.co.uk>

On 2014-12-08, Tuxedo <tuxedo@mailinator.com> wrote:
> use strict;
> use warnings;
> use POSIX qw(strftime);
>
> my $datestring = strftime "%d %B %Y", gmtime;

I suggest 'localtime' instead of 'gmtime' - unless you need
everyone around the world to understand exactly the precise
time you mean, in which case I'd suffix GMT and want hours,
minutes and seconds.


   Justin.

-- 
Justin C, by the sea.


------------------------------

Date: Tue, 9 Dec 2014 21:13:15 +0100
From: Tuxedo <tuxedo@mailinator.com>
Subject: Re: Date procedure?
Message-Id: <m67l4s$ku5$1@news.albasani.net>

George Mpouras wrote:

[...]


> use strict;
> use warnings;
> use feature qw/say/;;
> 
> my $orig_date = '2014-12-08';
> 
> say ReformatDay('$DD $MM $Y',               $orig_date);
> say ReformatDay('$D ($DDDD) $M ($MMMM) $Y', $orig_date);
> say ReformatDay('$MMM/$DDD $Y',             $orig_date);
> 
> 
> #   $M    = 2
> #   $MM   = 02
> #   $MMM  = Feb
> #   $MMMM = February
> #   $D    = 9
> #   $DD   = 09
> #   $DDD  = Sun
> #   $DDDD = Sunday
> #   $Y    = 2014
> #
> sub ReformatDay
> {
> my ($Y,$M,$D) = $_[1] =~/^(\d{4})-0?(\d{1,2})-0?(\d{1,2})$/ or return
> undef;
> my $MM   = sprintf "%02d", $M;
> my $MMMM = {qw/1 January 2 February 3 March 4 April 5 May 6 June 7 July
> 8 August 9 September 10 October 11 November 12 December/}->{$M};
> my $MMM  = substr $MMMM, 0, 3;
> my $DD   = sprintf "%02d", $D;
> my $DDDD = sub {
> my ($d,$m,$y)=@_;
> $m < 3 && $y--;
> 
{0,'Sunday',1,'Monday',2,'Tuesday',3,'Wednesday',4,'Thursday',5,'Friday',6,'Saturday'}->{($d
> + ((qw/0 3 2 5 0 3 5 1 4 6 2 4/)[$m-1])
> +$y+(int($y/4))-(int($y/100))+(int($y/400)))%7}
> }->($D,$M,$Y);
> my $DDD = substr $DDDD, 0, 3;
> eval "\"$_[0]\""
> }

Many thanks for sharing this example. It appears to work perfectly!

Tuxedo


------------------------------

Date: Tue, 9 Dec 2014 21:34:51 +0100
From: Tuxedo <tuxedo@mailinator.com>
Subject: Re: Date procedure?
Message-Id: <m67mdc$o4j$1@news.albasani.net>

gamo wrote:

[...]
 
> You can do:
> 
> A)  $date =~ s/(\d+)\-(\d+)\-(\d+)/$3-$2-$1/;
> 
> B) my @month = qw(NULL enero febrero marzo abril mayo junio julio
> agosto septiembre octubre noviembre diciembre);
> 
> and call $month[12] whatever you want.

Many thanks for posting this. I'm not quite sure how it works but will test.

Tuxedo


------------------------------

Date: Tue, 09 Dec 2014 20:35:03 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Date procedure?
Message-Id: <87a92wd0js.fsf@doppelsaurus.mobileactivedefense.com>

Tuxedo <tuxedo@mailinator.com> writes:
> George Mpouras wrote:
>
> [...]
>
>
>> use strict;
>> use warnings;
>> use feature qw/say/;;
>> 
>> my $orig_date = '2014-12-08';
>> 
>> say ReformatDay('$DD $MM $Y',               $orig_date);
>> say ReformatDay('$D ($DDDD) $M ($MMMM) $Y', $orig_date);
>> say ReformatDay('$MMM/$DDD $Y',             $orig_date);
>> 
>> 
>> #   $M    = 2
>> #   $MM   = 02
>> #   $MMM  = Feb
>> #   $MMMM = February
>> #   $D    = 9
>> #   $DD   = 09
>> #   $DDD  = Sun
>> #   $DDDD = Sunday
>> #   $Y    = 2014
>> #
>> sub ReformatDay
>> {
>> my ($Y,$M,$D) = $_[1] =~/^(\d{4})-0?(\d{1,2})-0?(\d{1,2})$/ or return
>> undef;
>> my $MM   = sprintf "%02d", $M;
>> my $MMMM = {qw/1 January 2 February 3 March 4 April 5 May 6 June 7 July
>> 8 August 9 September 10 October 11 November 12 December/}->{$M};

[more of this]

> Many thanks for sharing this example. It appears to work perfectly!

People who are less keen on baffling others with funny characters and
pointless byzantinisms might feel inclined to use something like this
instead:

-----------
use POSIX;

my $d = '2014-12-08';

sub reformat
{
    my ($y, $m, $d) = split(/-/, $_[0]);
    return (strftime("%e %B %Y", 0, 0, 0, $d, $m - 1, $y - 1900) =~ / ?(.*)/)[0];
}

print(reformat($d), "\n");
-----------

If your system has a strftime man page, you can get the full set of
available format specifiers from there.


------------------------------

Date: Tue, 9 Dec 2014 22:55:53 +0100
From: Tuxedo <tuxedo@mailinator.com>
Subject: Re: Date procedure?
Message-Id: <m67r5a$3ii$1@news.albasani.net>

Rainer Weikusat wrote:

[...]
> 
> People who are less keen on baffling others with funny characters and
> pointless byzantinisms might feel inclined to use something like this
> instead:
> 
> -----------
> use POSIX;
> 
> my $d = '2014-12-08';
> 
> sub reformat
> {
>     my ($y, $m, $d) = split(/-/, $_[0]);
>     return (strftime("%e %B %Y", 0, 0, 0, $d, $m - 1, $y - 1900) =~ /
>     ?(.*)/)[0];
> }
> 
> print(reformat($d), "\n");
> -----------

Perhaps a little less modular in returning localized month strings than the 
script posted previously by George Mpourasbut, but the above is much 
shorter and easier for me to see through.

> If your system has a strftime man page, you can get the full set of
> available format specifiers from there.

I do indeed have a strftime man page and will look into it.

Thanks for the pointer and for a most usable date reformatting procedure.

Tuxedo



------------------------------

Date: Tue, 09 Dec 2014 22:13:05 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Date procedure?
Message-Id: <87wq60bhfy.fsf@doppelsaurus.mobileactivedefense.com>

Tuxedo <tuxedo@mailinator.com> writes:
> Rainer Weikusat wrote:

[...]

>> -----------
>> use POSIX;
>> 
>> my $d = '2014-12-08';
>> 
>> sub reformat
>> {
>>     my ($y, $m, $d) = split(/-/, $_[0]);
>>     return (strftime("%e %B %Y", 0, 0, 0, $d, $m - 1, $y - 1900) =~ /
>>     ?(.*)/)[0];
>> }
>> 
>> print(reformat($d), "\n");
>> -----------
>
> Perhaps a little less modular in returning localized month strings than the 
> script posted previously by George Mpourasbut, but the above is much 
> shorter and easier for me to see through.

If you don't want localized names, you can just use an array of month
names and index that by the month number - 1 (since array indices start
with zero):

-----------
my $d = '2014-12-08';
my @months = qw(January February March April May June July August September October November December);

sub reformat
{
    my ($y, $m, $d) = split(/-/, $_[0]);
    return sprintf("%d %s %d", $d, $months[$m - 1], $y);
}

print(reformat($d), "\n");


------------------------------

Date: Tue, 09 Dec 2014 13:27:30 +0000
From: TonyMc <afmcc@btinternet.com>
Subject: Re: Date procedure?
Message-Id: <87tx15x8al.fsf@btinternet.com>

HASM <netnews@invalid.com> writes:

> the land of the 12-08-2014 format

Ah yes, the land where, every 8th December, grouse need to worry about
what date format the gunmen are using.

Tony


------------------------------

Date: Wed, 10 Dec 2014 09:01:53 +0100
From: Tuxedo <tuxedo@mailinator.com>
Subject: Re: Date procedure?
Message-Id: <m68ulh$2vc$1@news.albasani.net>

Rainer Weikusat wrote:

[...]

> 
> If you don't want localized names, you can just use an array of month
> names and index that by the month number - 1 (since array indices start
> with zero):
> 
> -----------
> my $d = '2014-12-08';
> my @months = qw(January February March April May June July August
> September October November December);
> 
> sub reformat
> {
>     my ($y, $m, $d) = split(/-/, $_[0]);
>     return sprintf("%d %s %d", $d, $months[$m - 1], $y);
> }
> 
> print(reformat($d), "\n");

The above procedure is now both short and modular, quite perfect!

Many thanks,
Tuxedo



------------------------------

Date: Wed, 10 Dec 2014 15:18:38 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Date procedure?
Message-Id: <87d27r1qk1.fsf@doppelsaurus.mobileactivedefense.com>

HASM <netnews@invalid.com> writes:
> Tuxedo <tuxedo@mailinator.com> writes:
>
>> So far I did not yet find a way to convert a 2014-12-08 type of date
>> string to a more universally human readable format such as 8 December
>> 2014.
>
> I would hope we do move towards using 2014-12-08 as the "universally human
> readable format".  After all, it is the ISO standard, and the only format
> that sorts easily.  I use it all the time in the land of the 12-08-2014
> format, and have yet to have anyone blink at it.

There's also a 08.12.2014/ 08-12-2014 format: Someone from Germany
without exposure to 'strange, anglo-saxon conventions' will likely
interpret both 2014-12-08 and 12-08-2014 as '12th of August 2014' (and
probably consider the first one a bit weird is it doesn't progress from
smaller to large units, considering that '2014th of December 08' (1908?
2008?) makes little sense).



------------------------------

Date: Wed, 10 Dec 2014 07:26:58 -0800
From: HASM <netnews@invalid.com>
Subject: Re: Date procedure?
Message-Id: <87tx13zfst.fsf@127.0.0.1>

Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:

>> I use it all the time in the land of the 12-08-2014
>> format, and have yet to have anyone blink at it.

> There's also a 08.12.2014/ 08-12-2014 format: Someone from Germany
> without exposure to 'strange, anglo-saxon conventions' will likely
> interpret both 2014-12-08 and 12-08-2014 as '12th of August 2014'

Except when there are little labeled boxes that force a certain standard,
I've been dating all my documents, official or otherwise, in both an
anglo-saxon country in the New World, a latin country in the Old World, and
have actually used it in a arabic/french country in some other world.  I
have yet to find someone that finds the format illegible.

-- HASM


------------------------------

Date: Wed, 10 Dec 2014 15:58:01 -0800
From: Robbie Hatley <see.my.sig@for.my.address>
Subject: Re: Dedup script is finished and works. Critiques?
Message-Id: <e5mdndNAZ5-bQxXJnZ2dnUVZ572dnZ2d@giganews.com>


On 12/7/2014 8:17 AM, George Mpouras wrote:

> your script is unacceptable.

I had several reasons for writing this program, and for
discussing it with folk here. But acceptance was not
one of them. As always, each person is free to accept
or reject whatever they want to.

> I have written at your  initial thread how you could implemented
> it to be 100 times faster but you replied that "you know what
> you want to do"

Disingenuous. I could write a perl script that's a *million*
times faster:

#! /usr/bin/perl
# FAST SCRIPT
exit 0;

But that wouldn't quite do what I want. I currently have my
dedup script doing what I want (after fixing one major bug I found),
at acceptable speeds. The interface can use some tweaking,
but the guts appear to be working well.


-- 
Cheers,
Robbie Hatley
Midway City, CA, USA
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
http://www.well.com/user/lonewolf/


------------------------------

Date: Wed, 10 Dec 2014 15:59:24 -0800
From: Robbie Hatley <see.my.sig@for.my.address>
Subject: Re: Dedup script is finished and works. Critiques?
Message-Id: <e5mdndJAZ5_EQxXJnZ2dnUVZ570AAAAA@giganews.com>


On 12/5/2014 4:11 AM, gamo wrote:

> El 05/12/14 a las 12:42, Robbie Hatley escribió:
>> An alternate way would be to present the set of files
>> serialized like so:
>>
>> The following files are duplicates of each other:
>> 1. Jack's Regular Expression Tutorial.pdf
>> 2. Jack's Regex Tutorial.pdf
>> 3. REGEXTUTORIAL.PDF
>> 4. Goleta-Summer.jpg
>> 5. jacksretut.pdf
>> 6. re.pdf
>> 7. jack_re_tutorial.pdf
>> Which (if any) files do you want to delete? Type a list
>> of serial numbers separated by spaces, then hit Enter.
>> To ignored these duplicates, just hit Enter.
>> To quit, type "q" and hit Enter.
>
> Better, but you have to rewrite the script.

No problem. I need the perl practice. :-)

> Good luck.

Thanks.


-- 
Cheers,
Robbie Hatley
Midway City, CA, USA
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
http://www.well.com/user/lonewolf/


------------------------------

Date: Wed, 10 Dec 2014 16:36:00 -0800
From: Robbie Hatley <see.my.sig@for.my.address>
Subject: Re: Dedup script is finished and works. Critiques?
Message-Id: <6d2dneGULulzexXJnZ2dnUVZ572dnZ2d@giganews.com>


On 12/5/2014 9:27 AM, Kaz Kylheku wrote:

> Your program wastes time comparing files that may be the same inode
> on disk: the same object. Take advantage of identical $ino fields.

Not every file system has "inodes", but I can look into that.
I'm not sure what would be in the "inode" field of "stat" on
a file system such as NTFS.

> Moreover, in that situation, there is often a good reason for
> the multiple entries for the same file. Someone or some program
> wanted the file in two or more places, without taking up space.
>
> The user should be informed that some foo/X and bar/Y are not just identical,
> but are in fact the same object.

I do now test with -f so that I'm only considering erasing
"regular files", not symbolic links or directories. I'm don't
know enough about Linux's file systems (EXT2, EXT3, EXT4) to
be 100% sure, but I don't personally know of any other way
besides symbolic links (what windows would call "shortcuts")
to have two files that appear to be different objects but
are actually the same object. Or are there things I still
need to learn about Linux file systems?


-- 
Cheers,
Robbie Hatley
Midway City, CA, USA
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
http://www.well.com/user/lonewolf/


------------------------------

Date: Thu, 11 Dec 2014 00:52:31 +0000 (UTC)
From: Kaz Kylheku <kaz@kylheku.com>
Subject: Re: Dedup script is finished and works. Critiques?
Message-Id: <20141210164100.531@kylheku.com>

On 2014-12-11, Robbie Hatley <see.my.sig@for.my.address> wrote:
>
> On 12/5/2014 9:27 AM, Kaz Kylheku wrote:
>
>> Your program wastes time comparing files that may be the same inode
>> on disk: the same object. Take advantage of identical $ino fields.
>
> Not every file system has "inodes", but I can look into that.

Whoa, big jump from "this does what I want" to "this has to work
on every fscking file system". :)

> I'm not sure what would be in the "inode" field of "stat" on
> a file system such as NTFS.

There should be an object field there. NTFS has the equivalent of inode
numbers. In Cygwin, the hard link system call link() works fine,
for instance. So does the ln shell command, and when you "ln foo bar",
you even see the link count 2 for both entries.

>> Moreover, in that situation, there is often a good reason for
>> the multiple entries for the same file. Someone or some program
>> wanted the file in two or more places, without taking up space.
>>
>> The user should be informed that some foo/X and bar/Y are not just identical,
>> but are in fact the same object.
>
> I do now test with -f so that I'm only considering erasing
> "regular files", not symbolic links or directories.

But hard links are regular files. The only clue that there are multiple links
to an object is that it shows a link count greater than one. (In a dedup
program, you wouldn't care about link count because you have hunted down all
the entries in a given tree that the user cares about, and you know
which are aliases for the same object.)

> I'm don't
> know enough about Linux's file systems (EXT2, EXT3, EXT4) to
> be 100% sure, but I don't personally know of any other way
> besides symbolic links (what windows would call "shortcuts")

Shortcuts are a feature of the Windows Shell (explorer.exe).

The operating system as such does not understand them.

You can resolve a .LNK with the Shell API (like ShellExecuteEx)
but not with, say, CreateProcess.

NTFS has actual symbolic links, too!

http://en.wikipedia.org/wiki/NTFS_symbolic_link

Funny quote:

  "The default security settings in Windows Vista/Windows 7 disallow
  non-elevated administrators and all non-administrators from creating symbolic
  links."

This is likely because the mountains of broken software on Windows will get
confused if it sees a symbolic link.

> to have two files that appear to be different objects but
> are actually the same object.

$ touch foo
$ ln foo bar  # foo bar are same file now


------------------------------

Date: Wed, 10 Dec 2014 16:24:43 -0800
From: Robbie Hatley <see.my.sig@for.my.address>
Subject: Latest version of "dedup", and questions on open files, unlink, and rm.
Message-Id: <s5qdnfHJ2dzUeRXJnZ2dnUVZ572dnZ2d@giganews.com>


I've pasted below the latest version of my "duplicate file
finder and destroyer" script. I found and fixed one MAJOR bug:
The script was flagging non-duplicate files as being
duplicates if 3 or more files of the exact same size but
different content exist in current directory. This was
because I was holding one file open while comparing it
to other files. The other file got their "read pointer"
reset to zero each time, but the first file didn't because
it was still open. So I moved the "open" statements to
just before the "BUFFER" loop (where I read chunks of
the files into buffers with "read"), and I moved the
"close" statements to just after the BUFFER loop.
I also put in "seek($fh, 0, 0)" statements just before
starting the loop, to make sure. (Redundant?)

SOME QUESTIONS ABOUT FILES:

1. When an "open" statement is set up to autovivify a
    file handle variable, and the open fails, is the
    file handle variable supposed to be set to "undef"?
    In my tests, sometimes it is, sometimes it isn't.
    Is that a Perl bug? Or is supposed to do that?

2. Since just testing to see if a $filehandle variable
    is "defined" is apparently NOT a good way to determine
    whether a file is open... what is? Is there some kind
    of "is_open()" function in Perl?

3. Is the "read pointer" always set to 0 when a file is
    opened for reading? Or can there be cases when it's not?

4. My dedup script is using "unlink" to "erase" files. On
    windows, this is doing what it should, because "shortcuts"
    or "links" point to the main directory entry for a file,
    not the file data. But from what I've read, on Linux,
    if symbolic links to a file exist anywhere, those links
    will still remain valid even if the main file entry is
    "unlinked" from its directory. Perhaps something like
    "system rm $filename" or even `rm $filename` might be
    a better way, if I want to actually get rid of a file
    completely (and invalidate any links to it)?



Latest version of my "dedup" script is below for reference:



#!/usr/bin/perl

################################################################################
# dedup.perl                                                                   #
# Duplicate file finding/erasing program.                                      #
# Written by Robbie Hatley, starting 2005-06-21, as a "learn Perl" exercise.   #
# Plan: Recursively descend directory tree starting from current working       #
# directory, and make a master list of all files encountered on this branch.   #
# Order the list by size.  Within each size group, compare each file, from     #
# left to right, to all the files to its right.  If a duplicate pair is found, #
# alert user and get user input.  Give user these choices:                     #
# 1. Erase first file                                                          #
# 2. Erase second file                                                         #
# 3. Ignore this pair of duplicate files and move to next                      #
# 4. Quit                                                                      #
# If user elects to delete a file, delete it, then move to next duplicate.     #
# Edit history:                                                                #
#    Tue Jun 21, 2005 - Started writing it.                                    #
#    Thu Nov 20, 2014 - Getting back to this exercise after 9-year hiatus.     #
#    Mon Nov 24, 2014 - Got it working up to the point of alerting user as to  #
#                       whether each pair of same-size files are identical or  #
#                       different.                                             #
#    Thu Dec 04, 2014 - Now fully functional.                                  #
################################################################################

use v5.14;
use strict;
use warnings;

use Cwd;

sub get_character {
    system "stty cbreak </dev/tty >/dev/tty 2>&1";
    my $char = getc;
    system "stty icanon </dev/tty >/dev/tty 2>&1";
    return $char;
}

my $CurDir;
my %CurDirFiles;

# Get and open current working directory:
$CurDir = getcwd();
opendir(my $dot, ".") or die "Can\'t open directory. $!";

# Iterate through current directory, collecting info on all "regular" files:
FILE: while (my $filename=readdir($dot))
{
    # We're only interested in "regular" files (not directories, symbolic links,
    # etc), so if current file isn't a regular file, move on to next file:
    next FILE if not -f $filename;

    # If we get to here, current file is a "regular" file, so collect its
    # size and modification time:
    my ($size, $mtime) = (stat($filename))[7,9];

    # Store file info record for this file in a hash of references to arrays of
    # references to hashs of file info, with the outer hash keyed by size.
    # Each new file size encountered autovivifies a new array in the hash for
    # that file size.
    push @{$CurDirFiles{$size}},
       {
          "Name" => $filename,
          "Size" => $size,
          "Time" => $mtime
       };

};
closedir($dot);

# Iterate through the keys of the hash, in inverse order of size.
# Compare any files of same size, byte for byte. If any duplicate
# files are found, alert user and ask what user wants to do:
SIZE: foreach my $size (sort {$b<=>$a} keys %CurDirFiles)
{
    my $count = scalar(@{$CurDirFiles{$size}});

    # If fewer than two files exist of this size, go to next size group:
    if ($count < 2) next SIZE;

    # "FIRST" LOOP: Iterate through all files except the last,
    # opening each file at top of loop and closing it at bottom:
    FIRST: for (my $i = 0 ; $i < ($count - 1) ; ++$i)
    {
       my $filename1 = ${$CurDirFiles{$size}}[$i]->{Name};

       # "SECOND" LOOP: For each "first" file, compare it to each file which
       # is to its right in the array. (This is why $j starts at $i + 1
       # instead of 0 or 1 as you might expect.)
       SECOND: for (my $j = $i + 1 ; $j < $count ; ++$j)
       {
          my $filename2 = ${$CurDirFiles{$size}}[$j]->{Name};

          # Open both files here, shortly before entering buffer loop.
          # They will both be closed immediately after exiting buffer loop.

          open (my $filehandle1, "< :raw", $filename1)
          or say "Couldn't open first file $filename1"
          and next FIRST;

          open (my $filehandle2, "< :raw", $filename2)
          or say "Couldn't open second file $filename2"
          and close($filehandle1)
          and next SECOND;

          # Before entering buffer loop, seek both files to position 0:
          seek($filehandle1, 0, 0);
          seek($filehandle2, 0, 0);

          my $bufferfulls = 0;  # Create a buffer counter and set it to 0.
          my $duplicates  = 1;  # Start by assuming current files are duplicates.

          # BUFFER LOOP: Read data from first and second files, in buffers of
          # 1 MiB, and compare first buffer to second buffer. If a difference is
          # found, mark files as "different" and exit loop; else continue
          # reading the files until out of data:
          BUFFER: while (1)
          {
             my $read_result1 = read($filehandle1, my $buffer1, 1048576);
             my $read_result2 = read($filehandle2, my $buffer2, 1048576);

             # Both read results should at least be DEFINED, even if we
             # attempted to read past end of files. If not, die:
             my $fail_flag1 = not defined $read_result1;
             my $fail_flag2 = not defined $read_result2;
             if $fail_flag1 say("ERROR: Can't read first file $filename1.");
             if $fail_flag2 say("ERROR: Can't read first file $filename2.");
             die if $fail_flag1 || $fail_flag2;

             # If the two read results are defined but not equal, that too is an
             # error condition, so print error messages and die:
             if ($read_result1 != $read_result2)
             {
                say "Sync failure between same-size files:";
                say "$filename1";
                say "$filename2";
                die;
             }

             # If $read_result1 is less than 1, we're trying to read past
             # end of file, so exit buffer loop. (No sense testing BOTH
             # read results, because we just ascertained they're equal.)
             last BUFFER if ($read_result1 < 1);

             # If we get to here, we just read at least 1 byte (and at most
             # 1 MiB) into both buffers, so increment our "bufferfulls" counter:
             ++$bufferfulls;

             # If buffers are equal, read next chunks of files:
             if ( $buffer1 eq $buffer2 )
             {
                next BUFFER;
             }

             # Otherwise, these files are not duplicates, so exit buffer loop:
             else
             {
                $duplicates = 0;
                last BUFFER;
             }
          } # End BUFFER loop.

          # Close both files immediately after exiting buffer loop:
          close($filehandle1);
          close($filehandle2);

          # If we didn't actually read anything, an error has occurred,
          # so print error message and die, unless file size is zero,
          # in which case the file of course contains zero bytes:
          if ( ( $size > 0 ) and ( $bufferfulls < 1 ) )
          {
             say "ERROR: Failed to actually read any data,";
             say "even though these files are of non-zero size:";
             say "$filename1";
             say "$filename2";
             die;
          }

          # If the two files are duplicates, give user choices on what to do:
          if ($duplicates)
          {
             say("\n\n${filename1}\nis identical to\n${filename2}");
             say("Type 1 to erase ${filename1}");
             say("Type 2 to erase ${filename2}");
             say("Type 3 to ignore these duplicates");
             say("Type 4 to abort program and return to bash");
             my $invalid = 1;
             while ($invalid)
             {
                my $char = get_character;
                given ($char)
                {
                   when ("1")
                   {
                      $invalid = 0;        # Clear invalid flag.
                      unlink($filename1);  # Erase first file.
                      next FIRST;          # Move on to next first file.
                   }
                   when ("2")
                   {
                      $invalid = 0;        # Clear invalid flag.
                      unlink($filename2);  # Erase second file.
                      next SECOND;         # Move on  to next second file.
                   }
                   when ("3")
                   {
                      $invalid = 0;        # Clear invalid flag.
                      next SECOND;         # Just move on to next second file.
                   }
                   when ("4")
                   {
                      exit 0;              # Abort operations and exit program.
                   }
                   default
                   {
                      say("\nInvalid character.");
                   }
                }#end given ($char)
             }#end while ($invalid)
          }#end if duplicates
       }#end SECOND
    }#end FIRST
}#end foreach size group
exit 0;




-- 
Cheers,
Robbie Hatley
Midway City, CA, USA
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
http://www.well.com/user/lonewolf/


------------------------------

Date: Thu, 11 Dec 2014 00:59:58 +0000 (UTC)
From: Kaz Kylheku <kaz@kylheku.com>
Subject: Re: Latest version of "dedup", and questions on open files, unlink, and rm.
Message-Id: <20141210165311.786@kylheku.com>

On 2014-12-11, Robbie Hatley <see.my.sig@for.my.address> wrote:
>
> I've pasted below the latest version of my "duplicate file
> finder and destroyer" script. I found and fixed one MAJOR bug:
> The script was flagging non-duplicate files as being
> duplicates if 3 or more files of the exact same size but
> different content exist in current directory. This was
> because I was holding one file open while comparing it
> to other files. The other file got their "read pointer"
> reset to zero each time, but the first file didn't because
> it was still open. So I moved the "open" statements to
> just before the "BUFFER" loop (where I read chunks of
> the files into buffers with "read"), and I moved the
> "close" statements to just after the BUFFER loop.
> I also put in "seek($fh, 0, 0)" statements just before
> starting the loop, to make sure. (Redundant?)
>
> SOME QUESTIONS ABOUT FILES:
>
> 1. When an "open" statement is set up to autovivify a
>     file handle variable, and the open fails, is the
>     file handle variable supposed to be set to "undef"?
>     In my tests, sometimes it is, sometimes it isn't.
>     Is that a Perl bug? Or is supposed to do that?
>
> 2. Since just testing to see if a $filehandle variable
>     is "defined" is apparently NOT a good way to determine
>     whether a file is open... what is? Is there some kind
>     of "is_open()" function in Perl?
>
> 3. Is the "read pointer" always set to 0 when a file is
>     opened for reading? Or can there be cases when it's not?
>
> 4. My dedup script is using "unlink" to "erase" files. On
>     windows, this is doing what it should, because "shortcuts"
>     or "links" point to the main directory entry for a file,
>     not the file data. But from what I've read, on Linux,
>     if symbolic links to a file exist anywhere, those links
>     will still remain valid even if the main file entry is
>     "unlinked" from its directory. Perhaps something like
>     "system rm $filename" or even `rm $filename` might be
>     a better way, if I want to actually get rid of a file
>     completely (and invalidate any links to it)?

Firstly, if your dedup program operates on a given directory and its
descendants, in other words a tree, it may end up invaliding symbolic links
which are outside of that tree, which point into the tree.

You have to ask yourself whether you care about those. If so, then you have
to traverse the entire filesystem to find the freaking things.

If you want to identify dangling symlinks, then you have to resolve them to
paths. It probably behooves you to resolve them to canonical, absolute paths.
Canonical means that a path doesn't contain any components that are symlinks;
it is fully resolved.

Whenever you script deletes a file, it should also  form its canonical path, and look
that up in its database of symlinks to answer the question, "am I
deleting the target of one or more symlink?"

If so, that symlink can be erased also.

Keeping in mind that a symlink is an object that can vbe the target of
a symlink: so you have to recursively ask that question about the symlink
you are deleting. (The alternative is to, upfront, resolve all symlinks that
point to symlinks down to the canonical paths of what they ultimately refer to.
Then if A is a symlink to B, and B is a symlink to C, then you simply record
that as A and B both being symlinks to C. If C is deleted, then A and B are
identified as dangling (without any recursion) and blown away. Nothing points
to A and B.)


Basically, cleaning up symlinks can be regarded as a specialized job that
you might want to implement in a separate utility.

There are existing utilities for hunting down broken symlinks, so there is no
need to write such a thing.

Then again, there are existing utilities for hunting down duplicate files.


------------------------------

Date: Tue, 09 Dec 2014 07:51:50 +0100
From: gamo <gamo@telecable.es>
Subject: Re: Need a Mac Script to search and print lines in all text files
Message-Id: <m66665$qit$1@speranza.aioe.org>

El 09/12/14 a las 02:35, Keith Keller escribió:
>> No one has a ready made method? Really?
>> >I thought this would be a commonly used function.

> The File::Find Perl module is the commonly used library a Perl
> programmer would use.  But it seems like you're not using Perl.
>
> There are newsgroups for OS X where your question would be less
> inappropriate.

Yeah. Ask Siri.

-- 
http://www.telecable.es/personales/gamo/
The generation of random numbers is too important to be left to chance


------------------------------

Date: Tue, 9 Dec 2014 09:32:50 +0000
From: Justin C <justin.1410@purestblue.com>
Subject: Re: Need a Mac Script to search and print lines in all text files
Message-Id: <27silb-mk1.ln1@zem.masonsmusic.co.uk>

On 2014-12-08, Chris Roberts <thecjguy1@gmail.com> wrote:
>
> Jue, Mac defaults to using bash.
> The Google search you gave just points basically to a perl-find MAN page, so not helpful and I don't see anything in there either Perl or Glob that can search both recursively and *.txt files for a string.
>
> This is so much easier in Windows...
> It is unbelievable that no one has an simple explanation for something that is so easy to do within DOS.
> If I do egrep -i "jun.*5 .*info" /syslog/syslog.log

That is about the funniest thing I've seen this week.

You're on a Mac. It's running, mostly, a unix operating system,
it has all of unix's fun and cool toys. Do you know where grep
and it's siblings originated? I'll give you a hint, it is not
a Windows/DOS program, it had to be *ported* to run there.

Have a look at what tools are in your toolbox, and don't jump
for the CNC machine when all you need is a screwdriver, and, oh
look, it's right there.


   Justin.

-- 
Justin C, by the sea.


------------------------------

Date: Tue, 09 Dec 2014 21:25:21 -0500
From: Kevin Walzer <kw@codebykevin.com>
Subject: Re: Need a Mac Script to search and print lines in all text files
Message-Id: <m68au0$igo$1@dont-email.me>

On 12/7/14, 6:39 PM, Chris Roberts wrote:
> egrep -i -r 'My string /Users/crzzy1/*.txt' ;

On my Mac, running this command:

which egrep

returns this output:

/usr/bin/egrep

So, just run this command:

egrep -i -r 'My string /Users/crzzy1/*.txt

and that should work for you. Under the hood OS X is a certified Unix 
and has all the customary Unix command-line tools.

I must concur with those who question what this thread has to do with 
Perl, however.

--Kevin
-- 
Kevin Walzer
Code by Kevin/Mobile Code by Kevin
http://www.codebykevin.com
http://www.wtmobilesoftware.com


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 4326
***************************************


home help back first fref pref prev next nref lref last post