[25422] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 7667 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Jan 18 11:05:52 2005

Date: Tue, 18 Jan 2005 08:05:23 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 18 Jan 2005     Volume: 10 Number: 7667

Today's topics:
    Re: "Found Perl" <news@chaos-net.de>
    Re: "Found Perl" <wyzelli@yahoo.com>
    Re: "Found Perl" <flavell@ph.gla.ac.uk>
    Re: [newbie] chomp acting weird (or me not understandin <hendrik_maryns@despammed.com>
    Re: [perl-python] 20050118 keyed list (Jay Tilton)
    Re: [perl-python] 20050118 keyed list <jurgenex@hotmail.com>
    Re: CGI: Execute a perl script inside another perl scri <xdarcos@hotmail.com>
    Re: CGI: Execute a perl script inside another perl scri <noreply@gunnar.cc>
    Re: CGI: Execute a perl script inside another perl scri <xdarcos@hotmail.com>
    Re: CGI: Execute a perl script inside another perl scri <noreply@gunnar.cc>
        How bad is $'? (Was: "Get substring of line") (J. Romano)
    Re: How to compare files? <do-not-use@invalid.net>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: 18 Jan 2005 11:16:01 GMT
From: Martin Kissner <news@chaos-net.de>
Subject: Re: "Found Perl"
Message-Id: <slrncuprvh.nss.news@maki.homeunix.net>

brian d foy wrote :
>
> There is a lot of stuff I'd like to find, and even more stuff I probably
> don't know about. If you have something, please send me an image or scan
> at found@theperlreview.com[3].
>
>    * a picture of Amelia, the camel mascot of London.pm
>    * Nat's "Perl is my bitch" sticker set (What were the other slogans?
>    * An original blow-in card for The Perl Journal
>    * Pictures of other Perl t-shirts, especially the one given out at
>        the first couple of Perl conferences
>    * The O'Reilly beret given out at one of the Perl conferences
>    * Scans of signatures from various Perl people (I don't have
>        any myself)
>    * Tim Bunce's handwriting on a bar napkin saying "Generic database
>        interface: use GDI or something" :)
>    * Images of swag (keychains, water bottles, pens) that Perl
>        vendors gave out.
>    * Instances of the string "Perl" in everyday life. I remember seeing
>        an image of a European road sign pointing to "Perl". I think it
>        was German, but I can't find it.

Maybe you would be interested in a map with Perl on it.

	http://www.perl-mosel.de/rathaus/bilder/karte.gif
and
	http://www.stadtplan.net/index.asp?direct=brd/saarland/perl/home.html

Or the coat of arms of Perl.

	http://www.perl-mosel.de/bilder/perl_index2.jpg

HTH

-- 
Epur Si Muove (Gallileo Gallilei)


------------------------------

Date: Tue, 18 Jan 2005 11:33:52 GMT
From: "Peter Wyzl" <wyzelli@yahoo.com>
Subject: Re: "Found Perl"
Message-Id: <Au6Hd.123275$K7.8292@news-server.bigpond.net.au>

"Martin Kissner" <news@chaos-net.de> wrote in message 
news:slrncuprvh.nss.news@maki.homeunix.net...
:
: Maybe you would be interested in a map with Perl on it.
:
: http://www.perl-mosel.de/rathaus/bilder/karte.gif

Is there some significance to the fact that perl is right next to Apach?

:)

P
-- 
print "Just another Perl Hacker";




------------------------------

Date: Tue, 18 Jan 2005 11:37:17 +0000
From: "Alan J. Flavell" <flavell@ph.gla.ac.uk>
Subject: Re: "Found Perl"
Message-Id: <Pine.LNX.4.61.0501181131590.1547@ppepc56.ph.gla.ac.uk>

On Tue, 18 Jan 2005, Martin Kissner wrote:

> Maybe you would be interested in a map with Perl on it.
> 
> 	http://www.perl-mosel.de/rathaus/bilder/karte.gif

Hey, I hadn't realised that it was just across the border from Apach !

They go well together  ;-)

Google even does a triple whammy:

http://www.levkowetz.com/lbw/2001/pic.html?20010829-wed--perl-apach-borg/img_0090.jpg

Even better:
http://www.levkowetz.com/lbw/2001/pic.html?20010829-wed--perl-apach-borg/img_0124.jpg



------------------------------

Date: Tue, 18 Jan 2005 13:02:26 +0100
From: Hendrik Maryns <hendrik_maryns@despammed.com>
Subject: Re: [newbie] chomp acting weird (or me not understanding how it works??)
Message-Id: <46idnc6QQdRMZ3HcRVnyjg@scarlet.biz>

Michele Dondi schreef:
> On Mon, 17 Jan 2005 16:53:03 +0100, Hendrik Maryns
> <hendrik_maryns@despammed.com> wrote:
> 
> 
>>>Please note that just as much as the similar code pasted in my
>>>previous post, this is meant to be a minimal working example. Unless
>>>you need only a quick hack in a realistic app you'd probably
>>>explicitly write the loop and add bells and whistles...
>>
>>I think I will, as I don't think my professor will believe I wrote this 
>>myself...
> 
> 
> So this was homework... 

Even more: part of the exam :-(

while generally hackers and experienced users
> spot it at a glance, I don't think it's so bad to post questions about
> it, provided that you point out so in order to allow people to reply
> taking it into account.

Yes, you're right I should've mentioned, but I did try myself first and 
was actually asking about something that didn't work.  I really didn't 
count on someone making it for me here.  This was only a tiny bit of it.

> Also, this depends on your professor's point of view, but I see
> nothing wrong a priori in asking for help say here. Just make sure you
> understand all of the suggestions you've been given and use this
> knowledge, and anything else you may learn in the meantime to cook up
> your own solution.

That's exactly what I'm doing!  (Not that I understand all the 
suggestions yet...)

Cheers, H.


------------------------------

Date: Tue, 18 Jan 2005 13:36:59 GMT
From: tiltonj@erols.com (Jay Tilton)
Subject: Re: [perl-python] 20050118 keyed list
Message-Id: <41ed0f4d.39606090@news.erols.com>

"Xah Lee" <xah@xahlee.org> wrote:

: # the syntax of keyed list in Perl is too complex
: # to be covered in a short message.

You've got to be joking.  You couldn't even muster enough skill to provide
Perl equivalents for the four simple Python statements you showed?

: # see "perldoc perldata" for an unix-styled course.

It's good to see your opinion on the _fantastic incompetent writting_[1] of
Perl's documentation has changed enough that you can recommend it to
novices.

[1]Message-ID: <7fe97cc4.0401242131.22acf485@posting.google.com>



------------------------------

Date: Tue, 18 Jan 2005 14:35:57 GMT
From: "Jürgen Exner" <jurgenex@hotmail.com>
Subject: Re: [perl-python] 20050118 keyed list
Message-Id: <h99Hd.14468$c%6.11893@trnddc03>

Xah Lee wrote:
> © # in perl, keyed-list is done like this:

Just FYI: those thingies are called hashes. The legacy name would be 
associative array.

> © %a = ('john',3, 'mary', 4, 'jane', 5, 'vicky',7);
> © use Data::Dumper qw(Dumper);
> © print Dumper \%a;

Wow, my compliments. The very first time that using Data::Dumper actually 
may do something useful (formats the data more nicely). Still, why you are 
passing a reference is beyond me.

> © # the syntax of keyed list in Perl is too complex
> © # to be covered in a short message.

Excuse me? If (using the same examples as for your Python section)
    print "Mary is $a{mary}";
    delete $a{vicky}; print %a;
    print keys(%a);
    exists $a{mary};
is too complex for your to cover, then I strongly suggest you stop posting 
your "explanations".

> © # see "perldoc perldata" for an unix-styled course.

Excuse me? Do you mind explaining where exactly perldata is "Unix-styled"?

jue 




------------------------------

Date: 18 Jan 2005 03:10:11 -0800
From: "xad" <xdarcos@hotmail.com>
Subject: Re: CGI: Execute a perl script inside another perl script
Message-Id: <1106046611.552159.26070@f14g2000cwb.googlegroups.com>

Hello,

>Suppose you mean the full path. That always works, of course, but you
>can also do:

>    use lib '.';
>    do 'y.pl';

Actually it does not work yet.

x.pl:
-----------------------------------------------------
#!C:/Perl/bin/perl.exe
use lib '.';
do 'y.pl';

print "Content-type:text/html\n\n";

print @INC;
print "|FOO=".$ENV{'FOO'};
-------------------------------------------------
With that code, I can see that @INC is well updated. Indeed, the
following output is printed out in my browser
 .C:/Perl/libC:/Perl/site/lib|FOO=
but the script y.pl is not executed.

If I replace
use lib '.';
by
use lib 'D:/cc_views/v3_4/Dev/bin'; (that corresponds to the location
of the 2 scripts),
@INC is updated with the new location:
D:/cc_views/v3_4/Dev/binC:/Perl/libC:/Perl/site/lib.|FOO=BAR
and the script y.pl is well executed (FOO has been set to BAR)

It seems that SunOne Web Server does not interpret the '.' location  ?
Xavier

PS: "require" command has the same behavior



------------------------------

Date: Tue, 18 Jan 2005 12:44:51 +0100
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: CGI: Execute a perl script inside another perl script
Message-Id: <354blgF4ievp6U1@individual.net>

xad wrote:
>> 
>>   use lib '.';
>>   do 'y.pl';
> 
> Actually it does not work yet.
> 
> x.pl:
> -----------------------------------------------------
> #!C:/Perl/bin/perl.exe
> use lib '.';
> do 'y.pl';
> 
> print "Content-type:text/html\n\n";
> 
> print @INC;
> print "|FOO=".$ENV{'FOO'};
> -------------------------------------------------
> With that code, I can see that @INC is well updated. Indeed, the
> following output is printed out in my browser
> .C:/Perl/libC:/Perl/site/lib|FOO=
> but the script y.pl is not executed.
> 
> If I replace
> use lib '.';
> by
> use lib 'D:/cc_views/v3_4/Dev/bin'; (that corresponds to the location
> of the 2 scripts),
> @INC is updated with the new location:
> D:/cc_views/v3_4/Dev/binC:/Perl/libC:/Perl/site/lib.|FOO=BAR
> and the script y.pl is well executed (FOO has been set to BAR)
> 
> It seems that SunOne Web Server does not interpret the '.' location  ?

Maybe it doesn't acknowledge relative paths in @INC. Then, to prevent 
that the users need to edit the main script, you can try to manipulate 
@INC like this instead:

     use lib $0 =~ /(.+)[\\\/]/;

(Or you can use the FindBin module, even if it isn't mod_perl or taint 
safe.)

-- 
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl


------------------------------

Date: 18 Jan 2005 04:31:52 -0800
From: "xad" <xdarcos@hotmail.com>
Subject: Re: CGI: Execute a perl script inside another perl script
Message-Id: <1106051512.292199.208130@f14g2000cwb.googlegroups.com>

Thanks a lot Gunnar, both solutions work well.
Have a good day.

Xavier



------------------------------

Date: Tue, 18 Jan 2005 13:33:43 +0100
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: CGI: Execute a perl script inside another perl script
Message-Id: <354ehlF4is2njU1@individual.net>

Gunnar Hjalmarsson wrote:
>> It seems that SunOne Web Server does not interpret the '.' location
>> ?
> 
> Maybe it doesn't acknowledge relative paths in @INC.

Or more likely, since '.' represents the current working directory, the
working directory is some other directory but the one where the scripts
are located.

> Then, to prevent that the users need to edit the main script, you can
> try to manipulate @INC like this instead:
> 
>     use lib $0 =~ /(.+)[\\\/]/;
> 
> (Or you can use the FindBin module, even if it isn't mod_perl or
> taint safe.)

Optionally you can try:

     chdir '.';
     do 'y.pl' or die $!;

-- 
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl


------------------------------

Date: 18 Jan 2005 07:29:11 -0800
From: jl_post@hotmail.com (J. Romano)
Subject: How bad is $'? (Was: "Get substring of line")
Message-Id: <b893f5d4.0501180729.660f1cac@posting.google.com>

On Wednesday, January 5, 2005, Uri Guttman said:
>
> ... try to actually isolate the issue in a proper
> benchmark. spend some deep time in thought in how
> to do it. post your benchmark for review. then talk
> about $' with some confidence.


   Okay, I'll take you up on that offer.

   (But be warned!  Expect a "tome".  :)


First, allow me to define some terminology:

 * A "candidate line" is a pattern match that can be easily
   written to use $`, $&, or $'.  They don't have to use
   them, as they can be written with explicit grabs.  But
   nevertheless, they are pattern matches that could benefit
   from the use of the $MATCH variables.

 * A "taint line" is a line of code that uses a $MATCH
   variable (such as $`, $&, or $').  The reason that it is
   called a "taint line" is because it "taints" all regular
   expressions with a performance penalty, whether they use
   the match variables or not.  Note that in this post,
   the usage of the word "taint" has nothing to do with
   Perl's "taint mode", a mode that is turned on with the
   "-T" switch.


Also, allow me to clear up a common mistake.  The following
two lines are NOT functionally identical:

      $a = $1  if $string =~ m/=(.*)$/;
      $a = $'  if $string =~ m/=/;

This is because the '.' matches all characters EXCEPT a
newline.  To make the two lines identical, either put an "s"
modifier on the first line or add a chomp($a) call to the
second.  In other words, these two lines are functionally
equivalent:

      $a = $1  if $string =~ m/=(.*)$/s;
      $a = $'  if $string =~ m/=/;

as are these:

      $a = $1  if $string =~ m/=(.*)$/;
      $a = $'  if $string =~ m/=/;  chomp($a);


Now to the benchmarking.

The first benchmark program I came up with was this one:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e7;
my $prefix = q!$a = $'  if "a" =~ m/a/;!;
my $line = '$a = 1  if "abc=xyz" =~ m/=/;';
my $badProgram = "$prefix $line";

timethese($count, {good => $line});
timethese($count, {bad => $badProgram, prefix => $prefix});
__END__


The main program being tested is stored in the $line
variable:

   '$a = 1  if "abc=xyz" =~ m/=/;

All it does is check to see if the string "abc=xyz" has an
equal sign (which of course it does).  If it finds one, it
sets $a to 1.

The first call to timethese() runs this line ten million
times.  Note that this line is not tainted, since the $'
variable is never used until it is eval'ed in the second
call to timethese().

The second call to timethese() runs that line, together with
the $prefix code, which is just a "taint line".  I could
just run that "bad" program and compare its time to the "good"
program, but then the "bad" program would have twice as many
regular expressions to run, unfairly making it look like it
is much worse than it really is.

That is why I decided to run the $prefix line as well.
Therefore, its time can be subtracted from the "bad" code's
time to see how code identical to the "good" code (but
tainted) suffers from the performance penalty.

By running the code, I get the following results (note that
I formatted the lines (but did not change the results) in
the interest of making the results fit nicely in limited
space):

   good:   ( 3.50 usr +  0.00 sys =  3.50 CPU)
   bad:    (31.89 usr +  0.00 sys = 31.89 CPU)
   prefix: (18.25 usr +  0.00 sys = 18.25 CPU)


Subtracting the times, I get:

      31.89 - 18.25 - 3.50 = 10.14 seconds

It appears that the performance penalty for pattern
matches is around 10.14 seconds per 10 million pattern
matches.  That's about one microsecond for every pattern
match.


Looking back at my program, I realized that someone might
accuse the program of being invalid, claiming that whatever
overhead timethese() uses to run code is subtracted out of
the "bad" code (when we subtract out $prefix's time) but
never subtracted from the "good" code, making the
performance penalty look smaller than it really is.

I'm not sure if there is an overhead, but just in case, I
decided to make a new variation of the above code to account
for this:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e7;
my $goodPrefix = q!$a = ""  if "a" =~ m/a/;!;
my $badPrefix = q!$a = $'  if "a" =~ m/a/;!;
my $line = '$a = 1  if "abc=xyz" =~ m/=/;';
my $goodProgram = "$goodPrefix $line";
my $badProgram = "$badPrefix $line";

timethese($count, {good => $goodProgram, goodPrefix => $goodPrefix});
timethese($count, {bad => $badProgram, badPrefix => $badPrefix});
__END__


This code now has a $goodPrefix (which is untainted) and a
$badPrefix (which is the taint line).  The "good" and the
"bad" code both will have their prefixes' time subtracted
out, putting them on more equal footing.  These were my
results:

good:        ( 7.72 usr +  0.02 sys =  7.73 CPU)
goodPrefix:  ( 3.83 usr +  0.00 sys =  3.83 CPU)
bad:         (30.56 usr +  0.01 sys = 30.58 CPU)
badPrefix:   (16.09 usr +  0.02 sys = 16.11 CPU)


The time difference between the tainted and untainted
pattern matches is:

      (30.56 - 16.09) - (7.72 - 3.83) = 10.58

That's a penalty of 10.58 seconds for ten million pattern
matches (or about one microsecond for every pattern match).

I thought about it a bit more, and I realized that the
string "abc=xyz" is a rather small string, quite possibly
much smaller than the average string used in a pattern
match.  It stands to reason that if the string-to-be-matched
is much larger than seven characters, the penalty incurred
from copying the string to $`, $&, and $' would be greater.

So I composed the following program:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e6;
my $goodPrefix = q!$a = ""  if "a" =~ m/a/;!;
my $badPrefix = q!$a = $'  if "a" =~ m/a/;!;
my $line = '$a = 1  if "abc=xyz" =~ m/=/;';

# Replace "abc=xyz" with longer string:
my $string = join('', 'A'..'Z','a'..'z','0'..'9');
$string = "$string=$string";
$string x= 1;  # change multiplier to make longer string
$line =~ s/abc=xyz/$string/;

my $goodProgram = "$goodPrefix $line";
my $badProgram = "$badPrefix $line";

timethese($count, {good => $goodProgram, goodPrefix => $goodPrefix});
timethese($count, {bad => $badProgram, badPrefix => $badPrefix});
__END__


The key line is this one:

   $string x= 1;  # change multiplier to make longer string

With the multiplier set to one, the main code line compares
the string:

   ABC...XYZabc...xyz012...789=ABC...XYZabc...xyz012...789

which is 125 characters long.  That's 56% larger than the
standard screen width of 80 characters.

I discovered something interesting:  as I doubled the
multiplier from 1 to 2, then to 4, then 8, and so on, I
discovered that the penalty did indeed increase, but not
linearly.  I would have expected the penalty to at least
double each time, but instead it increased very slowly at
first before it started to double.  I expect that its
tardiness in doubling was due to some overhead factor.

Anyway, for each multiplier, here is the measured
penalty (in microseconds) for a pattern match with that
long string:

  1:  1.08
  2:  1.14
  4:  1.19
  8:  1.67
 16:  1.84
 32:  2.73
 64:  3.78
128:  5.72
256:  9.89
512: 19.13

(Note that in the 512 case, 64 thousand characters were used
in every pattern match, with a penalty of about 20
microseconds (or 0.02 milliseconds) for every pattern
match.)

Although the doubling behavior of the penalty is latent, it
does appear that it eventually "doubles-out."  Of course, as
the multiplier increased to ever-greater numbers, the Perl
interpreter may run out of its allotted RAM, resorting to
other resources such as virtual memory & disk space to
finish its task.  Once this happens, the processes used to
supply "substitute RAM" will likely create a large
performance drop (significantly greater than just doubling).
However, as we see from my results, a pattern match on a
string of 64,000 characters still only has a penalty of
about 20 microseconds.


Once I did this I got to thinking that all of my programs
thus far use no candidate lines whatsoever.  Obviously, the
$MATCH variables are used in programs where some candidate
lines are present (otherwise, they're never used (or shouldn't
be, at any rate)).  So assuming that none of the regular
expressions are in candidate lines wouldn't necessarily be
fair to the code that uses the $MATCH variables.

So I decided to write this program that tests the
performance of a program where ALL its regular expressions
are in candidate lines:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e7;
my $bad = q!$string = $'  if "abc=xyz" =~ m/=/; chomp($string)!;
my $good = q!$string = $1  if "abc=xyz" =~ m/=(.*)$/!;

timethese($count, {good => $good});
timethese($count, {bad => $bad});
__END__


The results:

good: (19.92 usr +  0.00 sys = 19.92 CPU)
 bad: (17.38 usr +  0.00 sys = 17.38 CPU)

It looks like for once that the "bad" code fares better than
the "good" code (by an average of a quarter of a microsecond
per regular expression)!

I also tried replacing the "abc=xyz" string to the much
larger string of:

   ABC...XYZabc...xyz012...789=ABC...XYZabc...xyz012...789

repeated 256 times (for a total of 32,000 characters).
My results were:

good: (84.55 usr +  0.00 sys = 84.55 CPU)
 bad: (39.06 usr +  0.00 sys = 39.06 CPU)

Surprisingly, the "bad" code ran more than twice as fast as
the "good" code!  Apparently, the larger the string that has
to get captured, the better the "bad" code performs (that
is, when compared to the non-tainted code).

Therefore, I have to make the case that, in the event that
someone creates a Perl program that has candidate lines
for most of its regular expressions, it may actually be
faster to use the $MATCH variables.  (Of course, it may or
may not be faster to use the $MATCH variables, but if no
super-long strings are used, the difference will be tiny).

At any rate, unless super-long strings are used in
pattern-matches, the difference (for better or for worse) of
using the $MATCH variables on each regular expression is on
the order of microseconds per match.


Okay, so I went through all this trouble to measure the
speed of tainted lines vs. non-tainted lines.  However, all
the programs I've tested so far have no input or output,
which is pretty rare for useful Perl scripts.  Therefore,
unless you plan to write Perl programs that consist of only
pattern-matching and have absolutely no I/O involved, these
tests really aren't that useful in determining just how bad
a performance penalty from using the $MATCH variables can
be.

So I decided that my next logical step would be to benchmark
a program that did a task that Perl would conceivably be
used for:  Searching every file in the current directory
and printing out the number of lines that contain a certain
string ("Perl", in this case):


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e2;
my $prefix = q!$a = $'  if "a" =~ m/a/;!;
my $program =<<'END_OF_PROGRAM';
   @ARGV = grep -f, <*>;
   $b = 0;
   m/\bPerl\b/ and $b++  while <>;
END_OF_PROGRAM

timethese($count, {good => $program});
print "Count: $b\n";
timethese($count, {bad => "$prefix $program", prefix => $prefix});
print "Count: $b\n";
__END__


The results:

   good: (11.66 usr + 12.08 sys = 23.73 CPU)
Count: 342
    bad: (11.88 usr + 11.72 sys = 23.59 CPU)
 prefix: ( 0.00 usr +  0.00 sys =  0.00 CPU)
Count: 342

We see that the "good" code uses less usr time (at about 2.2
milliseconds per run), but for some reason it managed to use
more sys time (about 3.6 milliseconds per run).  I'm not
sure why this is so, but I'm inclined to think that because
the $MATCH penalty is so small compared to disk access, the
penalty doesn't really show up when run with code that uses
disk access.  In other words, the variation in time for
code that uses disk I/O is so great that it more than makes
up for just about any penalty introduced by the $MATCH
variables.


Okay, so at this point I realize that almost all performance
penalties incurred by the $MATCH variables are quite small,
and that when used in programs that use I/O processes, they
seem to become negligible.  But I wanted to perform one more
benchmark test:  How would the $' variable affect REAL Perl
programs, the heavy-duty scripts that we've written and are
proud of?

I have one such Perl program I wrote at work.  We would get
a special type of binary data files and we would need to
peer into them to see what data they held.  This is quite a
complicated process, as it would require a full-fledged
decoder to handle the special data types and
representations.  In order to peek into a new file that was
new to us, we would have to take our normal decoder (written
in C) and use the debugger to change the flow of execution
so that the contents of the file would be written out to the
screen for human eyes to read.

This technique was cumbersome and slow, so I wrote a Perl
program to parse through a binary file of that format and
spit out its contents in human-readable form.  It worked
great.  (And to think that the very first time I decoded a
file in such a format I was using a paper, pencil, and hex
dump output.)

Because that Perl script I wrote belongs to the company I
work for, I won't post it here, but I will give some
statistics about it:  In order to parse any binary file it
must first parse through three ASCII configuration files
(with 173, 558, and 2060 lines with each line going through
at least one regular expression).  For about every 1K of
input, the program spits out about 4K of output.  There are
quite a few regular expressions sprinkled throughout the
program; several of which are operating on the binary data
(one of which operates on an entire binary file).  For the
record, this program does NOT use the $`, $&, or $'
variables, nor the English module (or any other module, for
that matter).

I made two copies of this program:  one named "good.pl" and
one named "bad.pl".  I inserted the following line at the
top of both files:

      @ARGV = <*.bin>;

and inserted the following taint line at the top of
"bad.pl":

      $a = $'  if "a" =~ m/a/;

Then I ran the following script:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e3;
timethese($count, {good => "do 'good.pl'"});
timethese($count, {good => "do 'good.pl'"});
timethese($count, {bad => "do 'bad.pl'"});
__END__


Let me explain why I time "good.pl" twice:
Because I have several functions defined in the script,
timethese() keeps complaining that the functions are kept
being redefined over and over.  Obviously, it won't
complain about this the very first time the program is run,
making the first timethese() have an advantage over any
subsequent calls that test the same code.  Therefore, by
calling timethese() three times, I ensure that the second
and third call to timethese() are on equal footing (as far
as redefining functions goes).

I ran this program so that the code it called would have to
decode a somewhat large load of 55 files (around 5K each).

Here are the results:

Benchmark: timing 1000 iterations of good...
 good: (45.09 usr +  1.78 sys = 46.88 CPU)
Benchmark: timing 1000 iterations of good...
 good: (45.61 usr +  1.47 sys = 47.08 CPU)
Benchmark: timing 1000 iterations of bad...
  bad: (45.64 usr +  1.55 sys = 47.19 CPU)


It appears that, when comparing the CPU time, 1000 runs of
the program takes 0.11 seconds longer when I use the $MATCH
variables.  That means that I'd save only around 0.11
milliseconds (per single run) by avoiding them.

Just for fun, I decided to concatenate all those files (and a few
more) into one huge binary file (my program was made to handle
this), and then to "cat" that file several times into an even
bigger binary file.  Then the "good.pl" and "bad.pl" files were
changed to read that file as input.  For the record, that file
was 5,938,445 bytes.  I wrote my program so that it would do a
(successful) pattern match over the entire binary file contents
one time for each sub-file inside it.  That means that for every
time this program is run with this >5MB file as input (which is,
for this particular data format, unrealistically large), almost
six megabytes of data are copied into $`, $&, and $' three
hundred times.

And here are the results of running this program with a
abnormally large data set:

Benchmark: timing 1000 iterations of good...
 good: (129.78 usr +  2.56 sys = 132.34 CPU)
Benchmark: timing 1000 iterations of good...
 good: (129.39 usr +  2.39 sys = 131.78 CPU)
Benchmark: timing 1000 iterations of bad...
  bad: (130.36 usr +  2.94 sys = 133.30 CPU)


It looks like the performance penalty of using the $MATCH
variables is 133.30 - 131.78 = 1.52 seconds (for 1000
iterations).  For each run, less than 2 milliseconds is the
penalty for needlessly using the $MATCH variables.  Keep in
mind, for each run almost 6 megabytes of data were being
copied into $`, $&, and $' 300 times.  And the penalty was
not even two milliseconds.

And here is where others can help me out.  I'm curious to
see how your own favorite Perl programs fare.  If you'd like
to post your own results, take a good, robust Perl script
(that doesn't use the $MATCH variables), copy it to
"good.pl" and "bad.pl", add an @ARGV line to both files
(if necessary), and then add a "taint" line to the "bad.pl"
script.  Then run the above Perl code and compare the last
two benchmark readings (you may want to pipe the output to
"grep wallclock" to find the benchmark output if you normally
have a lot of output).


At this point, I have to say that these Perl programs I
wrote seem to show that the performance penalty of using $`,
$&, and $' are usually small and often negligible.



CONCLUSION


I happen to subscribe to Paul Graham's ideas on
optimization.  On page 213 of his book "ANSI Common Lisp"
(yes, Lisp), he writes:


   Three points can be made about optimization,
   regardless of the implementation:  it should be
   focused on bottlenecks, it should not begin too
   early, and it should begin with algorithms.

   Probably the most important thing to understand
   about optimization is that programs tend to have a
   few bottlenecks that account for a great part of
   the execution time. ... Optimizing these parts of
   the program will make it run noticeably faster;
   optimizing the rest of the program will be a
   waste of time in comparison.

   ...

   A corollary of the bottleneck rule is that one
   should not put too much effort into optimizing
   early in a program's life.  Knuth puts the point
   even more strongly: "Premature optimization is the
   root of all evil (or at least most of it) in
   programming."


In other words, optimizing code that is not a bottleneck may save
some time in the end, but the time it saves becomes negligible
when run together with a bottleneck, or at least with code that
has worse run-time behavior (measured in Big-O notation).  The
time spent in optimizing non-bottleneck code is usually much
greater than the few milliseconds it saves.

And since the penalty to using $`, $&, and $' seems to behave
at O(N) or possibly O(N*log(N)), the penalty will become
insignificant when used in programs that employ algorithms of
worse Big-O behavior (or that even use disk access).

So why does the penalty disclaimer seem to follow $'
wherever it is taught?  Personally, I think it's because the
penalty (however small) does exist, and programmers think
they have a responsibility to inform other programmers
(since one piece of code can affect another piece of
seemingly unrelated code).

But I still wondered about that, so I looked it up in the book
where I find learned about it:  "Learning Perl" by Randal L.
Schwartz & Tom Phoenix.  On page 122, it says,


   Now, we said earlier that these three are "free."
   Well, freedom has its price.  In this case, the
   price is that once you use any one of these
   automatic match variables anywhere in your entire
   program, other regular expressions will run a
   little more slowly.  Now this isn't a giant
   slowdown, but it's enough of a worry that many
   Perl programmers will simply never use these
   automatic match variables.[*]  Instead, they'll
   use a workaround.


And there's more:  A footnote is included:


   [*] Most of these folks haven't actually
       benchmarked their programs to see whether their
       workarounds actually save time, though; it's as
       though these variables were poisonous or
       something.  But we can't blame them for not
       benchmarking -- many programs that could benefit
       from these three variables take up only a few
       minutes of CPU time in a week, so benchmarking and
       optimizing would be a waste of time.  But in that
       case, why fear a possible extra millisecond?  By
       the way, Perl developers are working on this
       problem, but there will probably be no solution
       before Perl 6.


Apparently Randal L. Schwartz and Tom Phoenix don't seem to think
it would be such a big deal if some programs used $'.  I agree
with them, of course, but I do acknowledge that there are times
that it's better to use $' than others.

So I made a list of times it is acceptable to use $`, $&, and $':

1.  when most of the pattern matches in your program are
    candidate lines
2.  when none of the text strings being pattern-matched is longer
    than a thousand characters
3.  when there are few pattern matches when compared with the
    rest of the program
4.  when used along-side algorithms that have worse run-time
    behaviors

But it's also important to point out some times when it is not
acceptable to use those variables:

1.  when a pattern-matched string has a length greater than
    millions of characters
2.  when you are writing a Perl module for others to use

Point 1 is important for programmers who have to handle large
amount of binary data, or even long text strings, like DNA
sequences.  I tried a search engine, but couldn't find just how
much memory you would need to hold a complete human DNA sequence,
but I think I remember reading that it was something on the order
of fifteen gigabytes.  Reading something of that size takes up
enough time as it is; any needless copying will be definitely
noticeable and unwelcome.

Point 2 is important because your module might be used by people
who use super-long strings (such as DNA sequences).  So if you
feel the need to violate this point, at least have the decency to
document their usage, so that others are aware of it (and be
prepared to watch the usage of your module plummet).

And here is another list, enumerating reasons why it's okay to
use those variables in a script for your own use:

1.  If most of your pattern matches are candidate lines, there's
    a good chance that using $' (instead of m/=(.*)$/) will
    actually speed up your program.
2.  Even if it doesn't speed up your program, the time wasted
    will probably be negligible and won't ever be noticed.
3.  Even in the rare case where the time wasted isn't
    negligible, it should be a trivial matter to eliminate
    the penalty by editing the code and removing the match
    variables.

(I am reminded of the story I have heard about an enthusiastic
programmer who was determined to make his program run faster by
making it more "streamlined."  He replaced various parts of code
with presumably faster code, only to find out that the resulting
speed-up was negligible (or worse -- his new program ran slower).
The moral of this story is that it's not always obvious in what
parts of code most of the processor time is being used.  If a
section of code that uses very little processor time is
optimized, it might run faster, but the time it saves is
practically non-existent compared to the rest of the code.)

Bear in mind that using a tainted pattern match of:

      if (m/match/)
      {
         ...
      }

is somewhat equivalent to the untainted:

      if (m/^(.*?)(match)(.*)$/)
      {
         my $prematch  = $1;
         my $match     = $2;
         my $postmatch = $3;

         ...
      }

in that the copying into $prematch, $match, and $postmatch are
done whether you want them or not.  That amounts to the entire
string being copied over (possibly needlessly) for every
successful pattern match.  If the string is small, there's a good
chance that this penalty won't even be measurable.  However, in
the cases where the programmer has to make a conscious effort not
to needlessly copy a large string (possibly because he/she is
handling a large data stream), then the $MATCH variable should
probably best be avoided -- but ONLY if these large strings are
being matched with regular expressions (as opposed to, for
instance, the index() function).

Of course, if all your regular expressions with large strings
require you to capture the pre-match and post-match with
parentheses, it would probably be faster to go ahead and use the
$MATCH variables.

So it's your call as the programmer to make an educated decision
whether or not to use the $MATCH variables.  If regular
expressions are used only on small strings, the penalty most
likely will make no real difference.  But if large strings are
used, give it some thought.


Ironically, I never personally used $`, $&, and $' in my scripts
before I researched this (exactly because of the always-mentioned
penalty).  But now that I have made all these benchmarked tests
and examined the results, I just might start using them in my own
small programs.


(Constructive criticism is welcome.  Since usage of these match
variables is a touchy topic for some Perl programmers, please
don't flame me if I made a statement that doesn't agree with your
beliefs.  After all, I am only human, so if I said something that
you feel is incorrect or invalid, feel free to point it out and
comment on it, without resorting to needlessly harsh comments
about my programming skills.  Thank you.)

   Happy Perling!

   -- Jean-Luc Romano


------------------------------

Date: 18 Jan 2005 12:25:10 +0100
From: Arndt Jonasson <do-not-use@invalid.net>
Subject: Re: How to compare files?
Message-Id: <yzdmzv7htd5.fsf@invalid.net>


Martin Kissner <news@chaos-net.de> writes:
> Arndt Jonasson wrote :
> >
> > Martin Kissner <news@chaos-net.de> writes:
> >> Case wrote :
> >> > What is a good (readable) way of comparing two
> >> > files.  I just need to know if they match, or not.
> >> 
> >> since seem to be on Linux, you might use "diff"
> >> diff file1 file2
> >> diff -y file1 file2
> >> if you want to use the side by side output format.
> >
> > 'cmp' is better, if all you want to know is whether they differ or
> > not, and not present the differences. We don't know whether they are
> > text files.
> >
> > But what Case means by "readable", I don't know.
> 
> Neither do I.
> I don't even know if he wants to diff two files in the shell or by a
> Perlscript
> 
> In the shell, however, you could use 'diff -q' if you only want to know
> if the files differ. I don't know if 'cmp' or 'diff' is better.

"diff -q" does not seem to be found on all Unix versions.


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 7667
***************************************


home help back first fref pref prev next nref lref last post