[32152] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3417 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Jun 17 00:09:25 2011

Date: Thu, 16 Jun 2011 21:09:06 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 16 Jun 2011     Volume: 11 Number: 3417

Today's topics:
    Re: Catastrophic regexp performance with grouping. <xhoster@gmail.com>
    Re: Catastrophic regexp performance with grouping. <rvtol+usenet@xs4all.nl>
    Re: Catastrophic regexp performance with grouping. <arjenbax@googlemail.com>
    Re: Catastrophic regexp performance with grouping. <rvtol+usenet@xs4all.nl>
    Re: Catastrophic regexp performance with grouping. sln@netherlands.com
    Re: edit_file and edit_file_lines <Uno@example.invalid>
    Re: edit_file and edit_file_lines <tzz@lifelogs.com>
        Howto pass the arguments to the procedure in the easier <scottie383@gmail.com>
    Re: Howto pass the arguments to the procedure in the ea <jimsgibson@gmail.com>
    Re: Howto pass the arguments to the procedure in the ea <uri@StemSystems.com>
    Re: Howto pass the arguments to the procedure in the ea <tzz@lifelogs.com>
    Re: Regex Matching <uri@StemSystems.com>
    Re: Regex Matching sln@netherlands.com
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 15 Jun 2011 18:12:04 -0700
From: Xho Jingleheimerschmidt <xhoster@gmail.com>
Subject: Re: Catastrophic regexp performance with grouping.
Message-Id: <4df955fb$0$20836$ed362ca5@nr5-q3a.newsreader.com>

alf wrote:
> Greetings.
> 
> Consider the follwing code (5.8.8, perl -V follows):
> my $strf;
> {
> local $/=undef;
> $strf=<$ARGV[0]>; 
> }
> $strf =~ s/(?:\s*\r?\n?)*\cL/\cL/gx;

Since \r and \n are included in \s, the first part of the regex
is equivalent to (?:\s*)*, which in turn is equivalent to (?:\s*)

> # $strf =~ s/(\s*\r?\n?)*\cL/\cL/g;                                                                                                                               
> 
> When the second (commented) form of the regexp is used, it becomes about 300 times slower (less than 1 sec vs. 4 minutes on a 36894 lines file). Now I know that grouping (which is not even needed in the above) pays a price, but this seems pathological...ridiculus even.

Sure.  Don't use pathological regular expressions.

Xho


------------------------------

Date: Thu, 16 Jun 2011 10:12:49 +0200
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: Catastrophic regexp performance with grouping.
Message-Id: <4df9bb01$0$49038$e4fe514c@news.xs4all.nl>

On 2011-06-16 03:12, Xho Jingleheimerschmidt wrote:
> alf wrote:

>> $strf =~ s/(?:\s*\r?\n?)*\cL/\cL/gx;
>
> Since \r and \n are included in \s, the first part of the regex
> is equivalent to (?:\s*)*, which in turn is equivalent to (?:\s*)

Which can also be written as \s*.
:)

-- 
Ruud


------------------------------

Date: Thu, 16 Jun 2011 05:08:34 -0700 (PDT)
From: ilovelinux <arjenbax@googlemail.com>
Subject: Re: Catastrophic regexp performance with grouping.
Message-Id: <076b94e0-6801-42bc-bd19-dcb663077fe8@q14g2000prh.googlegroups.com>

On 15 jun, 18:30, alf <alessandro.forghi...@gmail.com> wrote:

> I see what you are saying. The data itself are \cL (<FF>) separated invoices

It seems that you want this:

$strf =~ s/\s+\f/\f/g;

(remove all whitespace in front of form feed)


------------------------------

Date: Thu, 16 Jun 2011 14:18:47 +0200
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: Catastrophic regexp performance with grouping.
Message-Id: <4df9f4a7$0$49180$e4fe514c@news.xs4all.nl>

On 2011-06-16 14:08, ilovelinux wrote:
> On 15 jun, 18:30, alf<alessandro.forghi...@gmail.com>  wrote:

>> I see what you are saying. The data itself are \cL (<FF>) separated invoices
>
> It seems that you want this:
>
> $strf =~ s/\s+\f/\f/g;
>
> (remove all whitespace in front of form feed)

That would also remove empty pages.

Non-eager version: s/\s+?\f/\f/g

-- 
Ruud


------------------------------

Date: Thu, 16 Jun 2011 07:57:15 -0700
From: sln@netherlands.com
Subject: Re: Catastrophic regexp performance with grouping.
Message-Id: <k86kv6licntsg1j37qgc38ui7v6g9vlsch@4ax.com>

On Thu, 16 Jun 2011 14:18:47 +0200, "Dr.Ruud" <rvtol+usenet@xs4all.nl> wrote:

>On 2011-06-16 14:08, ilovelinux wrote:
>> On 15 jun, 18:30, alf<alessandro.forghi...@gmail.com>  wrote:
>
>>> I see what you are saying. The data itself are \cL (<FF>) separated invoices
>>
>> It seems that you want this:
>>
>> $strf =~ s/\s+\f/\f/g;
>>
>> (remove all whitespace in front of form feed)
>
>That would also remove empty pages.
>
>Non-eager version: s/\s+?\f/\f/g

Eager version that won't remove empty pages: 

  s/[^\S\f]+\f/\f/g

-sln


------------------------------

Date: Thu, 16 Jun 2011 04:19:09 -0600
From: Uno <Uno@example.invalid>
Subject: Re: edit_file and edit_file_lines
Message-Id: <95u3ksFm6hU1@mid.individual.net>

On 05/14/2011 11:32 PM, Uri Guttman wrote:
>
> Have you ever wanted to use perl -pi inside perl? did you have the guts
> to localize $^I and @ARGV to do that? now you can do that with a simple
> call to edit_file or edit_file_lines in the new .018 release of
> File::Slurp. Now you can modify a file in place with a simple call.
>
> edit_file reads a whole file into $_, calls its code block argument and
> writes $_ back out the file. These groups are equivalent operations:
>
> 	perl -0777 -pi -e 's/foo/bar/g' filename
>
> 	use File::Slurp qw( edit_file ) ;
>
> 	edit_file { s/foo/bar/g } 'filename' ;
>
> 	edit_file sub { s/foo/bar/g }, 'filename' ;
>
> 	edit_file \&replace_foo, 'filename' ;
> 	sub replace_foo { s/foo/bar/g }
>
> edit_file_lines reads a whole file and puts each line into $_, calls its
> code block argument and writes each $_ back out the file. These groups are
> equivalent operations:
>
> 	perl -pi -e '$_ = "" if /foo/' filename
>
> 	use File::Slurp qw( edit_file_lines ) ;
>
> 	edit_file_lines { $_ = '' if /foo/ } 'filename' ;
>
> 	edit_file_lines sub { $_ = '' if /foo/ }, 'filename' ;
>
> 	edit_file \&delete_foo, 'filename' ;
> 	sub delete_foo { $_ = '' if /foo/ }
>
> So now when someone asks for a simple way to modify a file from inside
> Perl, you have an easy answer to give them.
>
>

Is this a commercial?
-- 
Uno


------------------------------

Date: Thu, 16 Jun 2011 09:09:41 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: edit_file and edit_file_lines
Message-Id: <87y611vpei.fsf@lifelogs.com>

On Thu, 16 Jun 2011 04:19:09 -0600 Uno <Uno@example.invalid> wrote: 

U> On 05/14/2011 11:32 PM, Uri Guttman wrote:
>> 
>> Have you ever wanted to use perl -pi inside perl? did you have the guts
>> to localize $^I and @ARGV to do that? now you can do that with a simple
>> call to edit_file or edit_file_lines in the new .018 release of
>> File::Slurp. Now you can modify a file in place with a simple call.

U> Is this a commercial?
U> -- 
U> Uno

Hey, *you* are the one advertising for a certain pizza chain!

Ted


------------------------------

Date: Thu, 16 Jun 2011 12:45:32 -0700 (PDT)
From: Scottie <scottie383@gmail.com>
Subject: Howto pass the arguments to the procedure in the easier way
Message-Id: <06561689-3c9d-41d5-8e47-70bcce696e7d@e35g2000yqc.googlegroups.com>

Hi All!

I have a procedure whose task is to count checksums for files that are
a input argument in subroutine. Argument can be:
- Fll files ('*') in a directory ($DIR).
- Files with the extensions, for example: '*.foo *.bar *.pl *.pm'
- Files with names beginning, for example: 'FOO* bar*'

--------------8<--------------
sub md5files {
   use File::Basename;

   my $dir      = $_[0];           # where the files are (must be a
direct path from root /)
   my $what     = $_[1] || '*';    # what files, default = '*' (all
files)
   my $md5file  = "$dir/checksums.md5";
   my %hash_md5;

   my @t = split(/ /, $what);

   open (MD5OUT, '>', $md5file ) or die "Err: $!";

   foreach my $f (@t) {
      foreach my $file2md5 (<$dir/$f>) {
         next if "$file2md5" eq "$md5file";
         my $md5 = &md5calc($file2md5);
         if($md5 ne "") {
            $file2md5 =~ s/$dir\///;
            print MD5OUT $md5 . "  " . $file2md5 . "\n";
            $hash_md5{$md5} = $file2md5;
         } else {
            return "";
         }
      }
   }
   close MD5OUT;
   return %hash_md5;
}
--------------8<--------------
Above sub can be called as follows:

my $DIR = '/tmp';
my %md5hash = &md5files($DIR);
my %md5hash = &md5files($DIR,'FOO* bar*');
my %md5hash = &md5files($DIR,'*.foo *.bar *.pl *.pm');


Is it possible to simplify my procedure? Could it be more perlish?


Thanks in advance.

Best regards,
--
Scottie


------------------------------

Date: Thu, 16 Jun 2011 16:26:53 -0700
From: Jim Gibson <jimsgibson@gmail.com>
Subject: Re: Howto pass the arguments to the procedure in the easier way
Message-Id: <160620111626536558%jimsgibson@gmail.com>

In article
<06561689-3c9d-41d5-8e47-70bcce696e7d@e35g2000yqc.googlegroups.com>,
Scottie <scottie383@gmail.com> wrote:

> Hi All!
> 
> I have a procedure whose task is to count checksums for files that are
> a input argument in subroutine. Argument can be:
> - Fll files ('*') in a directory ($DIR).
> - Files with the extensions, for example: '*.foo *.bar *.pl *.pm'
> - Files with names beginning, for example: 'FOO* bar*'
> 
> --------------8<--------------
> sub md5files {
>    use File::Basename;

I don't see where you use this module, although you certainly could.

> 
>    my $dir      = $_[0];           # where the files are (must be a
> direct path from root /)
>    my $what     = $_[1] || '*';    # what files, default = '*' (all
> files)

You can also do this:
     my( $dir, $what )  @_;

but then you will have to do something like one of these:

     $what ||= '*';
     $what //= '*';
     $what = ( defined $what ? $what : '*' );

>    my $md5file  = "$dir/checksums.md5";
>    my %hash_md5;
> 
>    my @t = split(/ /, $what);
> 
>    open (MD5OUT, '>', $md5file ) or die "Err: $!";

Lexical file handles are better these days:

     open( my $md5out, '>', $md5file ) ...

> 
>    foreach my $f (@t) {
>       foreach my $file2md5 (<$dir/$f>) {
>          next if "$file2md5" eq "$md5file";

No need to double-quote variables:

           next if $file2md5 eq $md5file;

>          my $md5 = &md5calc($file2md5);

No need to use ampersands to call subroutines:

           my $md5 = md5calc($file2md5);

>          if($md5 ne "") {
>             $file2md5 =~ s/$dir\///;

You could use File::Basename methods here.

>             print MD5OUT $md5 . "  " . $file2md5 . "\n";
>             $hash_md5{$md5} = $file2md5;

Is that what you want? Or should it be '$hash_md5{$file2md5} = $md5'?
If you are checksumming the contents of the files, then two files with
identical contents will have identical checksums, and you will
overwrite the first file path with the second one. There is also some
very small probability that checksums for two file will be the same.
File paths, on the other hand, must be unique within a file system.

>          } else {
>             return "";

Can this ever really happen? Do you really want to return a single
scalar if it does? Under 'use warnings', you will get a warning message
about 'Odd number of elements in hash assignment'. Maybe you should
just die if it does, or just do 'return', or print a warning and keep
going, or just ignore this case and write a blank checksum to the file.

>          }
>       }
>    }
>    close MD5OUT;

You should check errors when you close an output file.

     close($md5out) or die("Error writing $md5file: $!");

>    return %hash_md5;

You might want to return a reference to the hash here to avoid copying
a large hash, but that is up to you.

      return \%hash_md5;

> }
> --------------8<--------------
> Above sub can be called as follows:
> 
> my $DIR = '/tmp';
> my %md5hash = &md5files($DIR);
> my %md5hash = &md5files($DIR,'FOO* bar*');
> my %md5hash = &md5files($DIR,'*.foo *.bar *.pl *.pm');

Don't need those ampersands.

You might want to avoid checksumming a file twice if the caller passes
redundant patterns, such as 'FO* FOO*'. You can add this statement:

        next if $hash_md5{$file2md5};

if you have inverted your hash as suggested above. Otherwise, you can
use a separate hash to keep track of files already processed:

        my %seen;
        ...
        next if $seen{$file2md5};
        ...
        $seen{$file2md5} = 1;

> Is it possible to simplify my procedure? Could it be more perlish?

Simple is good. "Perlish"? Who cares.

-- 
Jim Gibson


------------------------------

Date: Thu, 16 Jun 2011 20:46:11 -0400
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: Howto pass the arguments to the procedure in the easier way
Message-Id: <87lix1uvxo.fsf@quad.sysarch.com>

>>>>> "JG" == Jim Gibson <jimsgibson@gmail.com> writes:

  JG>      $what = ( defined $what ? $what : '*' );

that is the same as this which is cleaner:

	$what = '*' unless defined $what ;

  >> Is it possible to simplify my procedure? Could it be more perlish?

  JG> Simple is good. "Perlish"? Who cares.

simpler perl is even better! :)

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Thu, 16 Jun 2011 20:21:52 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: Howto pass the arguments to the procedure in the easier way
Message-Id: <87hb7p45hr.fsf@lifelogs.com>

On Thu, 16 Jun 2011 12:45:32 -0700 (PDT) Scottie <scottie383@gmail.com> wrote: 

S> I have a procedure whose task is to count checksums for files that are
S> a input argument in subroutine. Argument can be:
S> - Fll files ('*') in a directory ($DIR).
S> - Files with the extensions, for example: '*.foo *.bar *.pl *.pm'
S> - Files with names beginning, for example: 'FOO* bar*'
 ...
S> my $DIR = '/tmp';
S> my %md5hash = &md5files($DIR);
S> my %md5hash = &md5files($DIR,'FOO* bar*');
S> my %md5hash = &md5files($DIR,'*.foo *.bar *.pl *.pm');

S> Is it possible to simplify my procedure? Could it be more perlish?

I would use glob() on the passed filenames.  There's no need to make
$DIR the first parameter.  You should allow more than one directory.
So, either accept parameters like this:

md5files({ "dir1" => "pattern1", "dir2" => [ "pattern2", "pattern3" ] })

or simply (asking the caller to call glob() appropriately)

md5files("/dir1/file1", "/dir2/file2", "/dir3/file3")

See `perldoc -f glob' for details on using glob().  The first usage
seems closer to what you need, since you're checking checksums per
directory.  To accept parameters like that, you'd do

# untested!
sub mine
{
 my $params = shift;

 die "Unexpected parameters: " . ref($params)
  unless ref $params eq 'HASH';

 foreach my $dir (sort keys %$params)
 {
  my $patterns = $params->{$dir};
  $patterns = [ $patterns ] unless ref $patterns eq 'ARRAY';
  foreach my $pattern (@$patterns)
  {
   foreach my $file (glob("$dir/$pattern"))
   {
    # do something in $dir for each $file
   }
  }
 }
}

I would not separate the file patterns with spaces.  If you have a list,
store it as a list.  The built-in Perl glob() actually accepts spaces as
you have it, but I would still use a list.

Ted


------------------------------

Date: Wed, 15 Jun 2011 22:11:04 -0400
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: Regex Matching
Message-Id: <87aadi7cg7.fsf@quad.sysarch.com>

>>>>> "D" == DanielC  <dnlchen@gmail.com> writes:

  D> On Jun 15, 1:24 pm, "Uri Guttman" <u...@StemSystems.com> wrote:
  >> 
  >> it helps if you are explicit about 'this'. do you mean ternary operator?
  >> do you mean (?:) in regexes? or was is @array in scalar context? what do
  >> you mean by "don't use reference very much"? the word "reference" wasn't
  >> mentioned in the quoted post (backreferences was).
  >> 
  >> and if they are perl devs and don't use any of those features, they
  >> aren't very good perl devs. those are basic useful things that should be
  >> in the toolbox of any decent perl developer.

  D> Sorry, I'm not a Perl Dev and I just use Perl like C. I meant "(?:) in
  D> regexes" and "reference".

it still makes no sense as reference. ?: has nothing to do with
references. it has something to do with backreferences in that is don't
do them like regular grabbing does.

and you should learn to use perl like perl. c in perl is almost always
very bad coding. in particular indexing arrays is not used nearly as
much in perl as in c and c style index loops are also rarely needed in
perl.

finally, please learn to edit quoted email. i don't need to see my
signature and all the previously quoted stuff again.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Thu, 16 Jun 2011 09:18:52 -0700
From: sln@netherlands.com
Subject: Re: Regex Matching
Message-Id: <tl9kv613m5m4jnmpaptsrpo5tjli2qoj9g@4ax.com>

On Wed, 15 Jun 2011 13:56:50 -0700 (PDT), DanielC <dnlchen@gmail.com> wrote:

>Sorry, I'm not a Perl Dev and I just use Perl like C. I meant "(?:) in
>regexes" and "reference".
>

The question mark is tightly bound to a preceeding opening parenthesis
when parsing a regular expression.

In modern day regex engines, this is the syntax used to define language
extensions. Its fairly standardized in a variety of program languages using
external libraries like PCRE (Perl Compatable Regular Expressions).

The extensions below are becoming more of the standard where now there are
extensions to the extensions enabling things like recursion.
Microsoft has taken extensions to a new level with recursion, and grouping
that allows variable capture buffer context like /START (a\d+b)+ END/
creating a collection output.

Here is a little cheat sheet of extension syntax.
(From perlre docs)
-

Extended Patterns

Extension syntax - 
   The syntax is a pair of parentheses with a question mark
   as the first thing within the parentheses.
   The character after the question mark indicates the extension.

Some Perl/PCRE extensions:
-----------------------------
   (?:pattern)
     ^ 
       This is for clustering, not capturing; it groups subexpressions
       like "()", but doesn't make backreferences as "()" does.

   (?|pattern) 
     ^
       This is the "branch reset" pattern.

   (?=pattern) 
     ^
       A zero-width positive look-ahead assertion. 

   (?!pattern)
     ^
       A zero-width negative look-ahead assertion.

   (?<=pattern)
     ^^
       A zero-width positive look-behind assertion.

   (?'NAME'pattern)
   (?<NAME>pattern)
     ^ .. ^
       A named capture buffer.

   \k<NAME> 
   \k'NAME' 
       Named backreference.

   (?>pattern) 
     ^
       An "independent" subexpression.

   (?(condition)yes-pattern|no-pattern) 
     ^  ..     ^           ^?
       Conditional expression.

Some Perl specific extensions:
--------------------------------
   (?#text) 
   (?pimsx-imsx) 
   (?imsx-imsx:pattern) 
   (??{ code })
   (?{ code })
   (?PARNO)
   (?-PARNO)
   (?+PARNO)
   (?R)
   (?0) 
   (?&NAME) 




------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3417
***************************************


home help back first fref pref prev next nref lref last post