[31390] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 2642 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Oct 19 14:35:15 2009

Date: Mon, 19 Oct 2009 11:35:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Mon, 19 Oct 2009     Volume: 11 Number: 2642

Today's topics:
        Posting Guidelines for comp.lang.perl.misc ($Revision:  tadmc@seesig.invalid
    Re: Strange behavoiur when passing $1 to a sub (Heinrich Mislik)
    Re: Strange behavoiur when passing $1 to a sub <nospam-abuse@ilyaz.org>
    Re: Strange behavoiur when passing $1 to a sub (Heinrich Mislik)
    Re: Strange behavoiur when passing $1 to a sub <nospam-abuse@ilyaz.org>
        Triple escape apostrophes <spam.meplease@ntlworld.com>
    Re: Triple escape apostrophes <mritty@gmail.com>
    Re: Trying to parse/match a C string literal <jl_post@hotmail.com>
    Re: Trying to parse/match a C string literal sln@netherlands.com
        Want to write a script to note specific IP addresses. <hongyi.zhao@gmail.com>
    Re: Want to write a script to note specific IP addresse sln@netherlands.com
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 16 Oct 2009 02:25:13 -0500
From: tadmc@seesig.invalid
Subject: Posting Guidelines for comp.lang.perl.misc ($Revision: 1.9 $)
Message-Id: <MtidnTWmddJEgkXXnZ2dnUVZ_uGdnZ2d@giganews.com>

Outline
   Before posting to comp.lang.perl.misc
      Must
       - Check the Perl Frequently Asked Questions (FAQ)
       - Check the other standard Perl docs (*.pod)
      Really Really Should
       - Lurk for a while before posting
       - Search a Usenet archive
      If You Like
       - Check Other Resources
   Posting to comp.lang.perl.misc
      Is there a better place to ask your question?
       - Question should be about Perl, not about the application area
      How to participate (post) in the clpmisc community
       - Carefully choose the contents of your Subject header
       - Use an effective followup style
       - Speak Perl rather than English, when possible
       - Ask perl to help you
       - Do not re-type Perl code
       - Provide enough information
       - Do not provide too much information
       - Do not post binaries, HTML, or MIME
      Social faux pas to avoid
       - Asking a Frequently Asked Question
       - Asking a question easily answered by a cursory doc search
       - Asking for emailed answers
       - Beware of saying "doesn't work"
       - Sending a "stealth" Cc copy
      Be extra cautious when you get upset
       - Count to ten before composing a followup when you are upset
       - Count to ten after composing and before posting when you are upset
-----------------------------------------------------------------

Posting Guidelines for comp.lang.perl.misc ($Revision: 1.9 $)
    This newsgroup, commonly called clpmisc, is a technical newsgroup
    intended to be used for discussion of Perl related issues (except job
    postings), whether it be comments or questions.

    As you would expect, clpmisc discussions are usually very technical in
    nature and there are conventions for conduct in technical newsgroups
    going somewhat beyond those in non-technical newsgroups.

    The article at:

        http://www.catb.org/~esr/faqs/smart-questions.html

    describes how to get answers from technical people in general.

    This article describes things that you should, and should not, do to
    increase your chances of getting an answer to your Perl question. It is
    available in POD, HTML and plain text formats at:

     http://www.rehabitation.com/clpmisc.shtml

    For more information about netiquette in general, see the "Netiquette
    Guidelines" at:

     http://andrew2.andrew.cmu.edu/rfc/rfc1855.html

    A note to newsgroup "regulars":

       Do not use these guidelines as a "license to flame" or other
       meanness. It is possible that a poster is unaware of things
       discussed here.  Give them the benefit of the doubt, and just
       help them learn how to post, rather than assume that they do 
       know and are being the "bad kind" of Lazy.

    A note about technical terms used here:

       In this document, we use words like "must" and "should" as
       they're used in technical conversation (such as you will
       encounter in this newsgroup). When we say that you *must* do
       something, we mean that if you don't do that something, then
       it's unlikely that you will benefit much from this group.
       We're not bossing you around; we're making the point without
       lots of words.

    Do *NOT* send email to the maintainer of these guidelines. It will be
    discarded unread. The guidelines belong to the newsgroup so all
    discussion should appear in the newsgroup. I am just the secretary that
    writes down the consensus of the group.

Before posting to comp.lang.perl.misc
  Must
    This section describes things that you *must* do before posting to
    clpmisc, in order to maximize your chances of getting meaningful replies
    to your inquiry and to avoid getting flamed for being lazy and trying to
    have others do your work.

    The perl distribution includes documentation that is copied to your hard
    drive when you install perl. Also installed is a program for looking
    things up in that (and other) documentation named 'perldoc'.

    You should either find out where the docs got installed on your system,
    or use perldoc to find them for you. Type "perldoc perldoc" to learn how
    to use perldoc itself. Type "perldoc perl" to start reading Perl's
    standard documentation.

    Check the Perl Frequently Asked Questions (FAQ)
        Checking the FAQ before posting is required in Big 8 newsgroups in
        general, there is nothing clpmisc-specific about this requirement.
        You are expected to do this in nearly all newsgroups.

        You can use the "-q" switch with perldoc to do a word search of the
        questions in the Perl FAQs.

    Check the other standard Perl docs (*.pod)
        The perl distribution comes with much more documentation than is
        available for most other newsgroups, so in clpmisc you should also
        see if you can find an answer in the other (non-FAQ) standard docs
        before posting.

    It is *not* required, or even expected, that you actually *read* all of
    Perl's standard docs, only that you spend a few minutes searching them
    before posting.

    Try doing a word-search in the standard docs for some words/phrases
    taken from your problem statement or from your very carefully worded
    "Subject:" header.

  Really Really Should
    This section describes things that you *really should* do before posting
    to clpmisc.

    Lurk for a while before posting
        This is very important and expected in all newsgroups. Lurking means
        to monitor a newsgroup for a period to become familiar with local
        customs. Each newsgroup has specific customs and rituals. Knowing
        these before you participate will help avoid embarrassing social
        situations. Consider yourself to be a foreigner at first!

    Search a Usenet archive
        There are tens of thousands of Perl programmers. It is very likely
        that your question has already been asked (and answered). See if you
        can find where it has already been answered.

        One such searchable archive is:

         http://groups.google.com/advanced_search

  If You Like
    This section describes things that you *can* do before posting to
    clpmisc.

    Check Other Resources
        You may want to check in books or on web sites to see if you can
        find the answer to your question.

        But you need to consider the source of such information: there are a
        lot of very poor Perl books and web sites, and several good ones
        too, of course.

Posting to comp.lang.perl.misc
    There can be 200 messages in clpmisc in a single day. Nobody is going to
    read every article. They must decide somehow which articles they are
    going to read, and which they will skip.

    Your post is in competition with 199 other posts. You need to "win"
    before a person who can help you will even read your question.

    These sections describe how you can help keep your article from being
    one of the "skipped" ones.

  Is there a better place to ask your question?
    Question should be about Perl, not about the application area
        It can be difficult to separate out where your problem really is,
        but you should make a conscious effort to post to the most
        applicable newsgroup. That is, after all, where you are the most
        likely to find the people who know how to answer your question.

        Being able to "partition" a problem is an essential skill for
        effectively troubleshooting programming problems. If you don't get
        that right, you end up looking for answers in the wrong places.

        It should be understood that you may not know that the root of your
        problem is not Perl-related (the two most frequent ones are CGI and
        Operating System related), so off-topic postings will happen from
        time to time. Be gracious when someone helps you find a better place
        to ask your question by pointing you to a more applicable newsgroup.

  How to participate (post) in the clpmisc community
    Carefully choose the contents of your Subject header
        You have 40 precious characters of Subject to win out and be one of
        the posts that gets read. Don't waste them. Take care while
        composing them, they are the key that opens the door to getting an
        answer.

        Spend them indicating what aspect of Perl others will find if they
        should decide to read your article.

        Do not spend them indicating "experience level" (guru, newbie...).

        Do not spend them pleading (please read, urgent, help!...).

        Do not spend them on non-Subjects (Perl question, one-word
        Subject...)

        For more information on choosing a Subject see "Choosing Good
        Subject Lines":

         http://www.cpan.org/authors/id/D/DM/DMR/subjects.post

        Part of the beauty of newsgroup dynamics, is that you can contribute
        to the community with your very first post! If your choice of
        Subject leads a fellow Perler to find the thread you are starting,
        then even asking a question helps us all.

    Use an effective followup style
        When composing a followup, quote only enough text to establish the
        context for the comments that you will add. Always indicate who
        wrote the quoted material. Never quote an entire article. Never
        quote a .signature (unless that is what you are commenting on).

        Intersperse your comments *following* each section of quoted text to
        which they relate. Unappreciated followup styles are referred to as
        "top-posting", "Jeopardy" (because the answer comes before the
        question), or "TOFU" (Text Over, Fullquote Under).

        Reversing the chronology of the dialog makes it much harder to
        understand (some folks won't even read it if written in that style).
        For more information on quoting style, see:

         http://web.presby.edu/~nnqadmin/nnq/nquote.html

    Speak Perl rather than English, when possible
        Perl is much more precise than natural language. Saying it in Perl
        instead will avoid misunderstanding your question or problem.

        Do not say: I have variable with "foo\tbar" in it.

        Instead say: I have $var = "foo\tbar", or I have $var = 'foo\tbar',
        or I have $var = <DATA> (and show the data line).

    Ask perl to help you
        You can ask perl itself to help you find common programming mistakes
        by doing two things: enable warnings (perldoc warnings) and enable
        "strict"ures (perldoc strict).

        You should not bother the hundreds/thousands of readers of the
        newsgroup without first seeing if a machine can help you find your
        problem. It is demeaning to be asked to do the work of a machine. It
        will annoy the readers of your article.

        You can look up any of the messages that perl might issue to find
        out what the message means and how to resolve the potential mistake
        (perldoc perldiag). If you would like perl to look them up for you,
        you can put "use diagnostics;" near the top of your program.

    Do not re-type Perl code
        Use copy/paste or your editor's "import" function rather than
        attempting to type in your code. If you make a typo you will get
        followups about your typos instead of about the question you are
        trying to get answered.

    Provide enough information
        If you do the things in this item, you will have an Extremely Good
        chance of getting people to try and help you with your problem!
        These features are a really big bonus toward your question winning
        out over all of the other posts that you are competing with.

        First make a short (less than 20-30 lines) and *complete* program
        that illustrates the problem you are having. People should be able
        to run your program by copy/pasting the code from your article. (You
        will find that doing this step very often reveals your problem
        directly. Leading to an answer much more quickly and reliably than
        posting to Usenet.)

        Describe *precisely* the input to your program. Also provide example
        input data for your program. If you need to show file input, use the
        __DATA__ token (perldata.pod) to provide the file contents inside of
        your Perl program.

        Show the output (including the verbatim text of any messages) of
        your program.

        Describe how you want the output to be different from what you are
        getting.

        If you have no idea at all of how to code up your situation, be sure
        to at least describe the 2 things that you *do* know: input and
        desired output.

    Do not provide too much information
        Do not just post your entire program for debugging. Most especially
        do not post someone *else's* entire program.

    Do not post binaries, HTML, or MIME
        clpmisc is a text only newsgroup. If you have images or binaries
        that explain your question, put them in a publically accessible
        place (like a Web server) and provide a pointer to that location. If
        you include code, cut and paste it directly in the message body.
        Don't attach anything to the message. Don't post vcards or HTML.
        Many people (and even some Usenet servers) will automatically filter
        out such messages. Many people will not be able to easily read your
        post. Plain text is something everyone can read.

  Social faux pas to avoid
    The first two below are symptoms of lots of FAQ asking here in clpmisc.
    It happens so often that folks will assume that it is happening yet
    again. If you have looked but not found, or found but didn't understand
    the docs, say so in your article.

    Asking a Frequently Asked Question
        It should be understood that you may have missed the applicable FAQ
        when you checked, which is not a big deal. But if the Frequently
        Asked Question is worded similar to your question, folks will assume
        that you did not look at all. Don't become indignant at pointers to
        the FAQ, particularly if it solves your problem.

    Asking a question easily answered by a cursory doc search
        If folks think you have not even tried the obvious step of reading
        the docs applicable to your problem, they are likely to become
        annoyed.

        If you are flamed for not checking when you *did* check, then just
        shrug it off (and take the answer that you got).

    Asking for emailed answers
        Emailed answers benefit one person. Posted answers benefit the
        entire community. If folks can take the time to answer your
        question, then you can take the time to go get the answer in the
        same place where you asked the question.

        It is OK to ask for a *copy* of the answer to be emailed, but many
        will ignore such requests anyway. If you munge your address, you
        should never expect (or ask) to get email in response to a Usenet
        post.

        Ask the question here, get the answer here (maybe).

    Beware of saying "doesn't work"
        This is a "red flag" phrase. If you find yourself writing that,
        pause and see if you can't describe what is not working without
        saying "doesn't work". That is, describe how it is not what you
        want.

    Sending a "stealth" Cc copy
        A "stealth Cc" is when you both email and post a reply without
        indicating *in the body* that you are doing so.

  Be extra cautious when you get upset
    Count to ten before composing a followup when you are upset
        This is recommended in all Usenet newsgroups. Here in clpmisc, most
        flaming sub-threads are not about any feature of Perl at all! They
        are most often for what was seen as a breach of netiquette. If you
        have lurked for a bit, then you will know what is expected and won't
        make such posts in the first place.

        But if you get upset, wait a while before writing your followup. I
        recommend waiting at least 30 minutes.

    Count to ten after composing and before posting when you are upset
        After you have written your followup, wait *another* 30 minutes
        before committing yourself by posting it. You cannot take it back
        once it has been said.

AUTHOR
    Tad McClellan and many others on the comp.lang.perl.misc newsgroup.

-- 
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"


------------------------------

Date: 15 Oct 2009 11:40:56 GMT
From: Heinrich.Mislik@univie.ac.at (Heinrich Mislik)
Subject: Re: Strange behavoiur when passing $1 to a sub
Message-Id: <4ad70a48$0$10578$3b214f66@usenet.univie.ac.at>

In article <3b76d68a-980c-4889-9c68-6df9fee8b56f@s31g2000yqs.googlegroups.com>, smallpond@juno.com says...

>perldoc perlvar
>$<digits> "These variables are all read-only and dynamically
>scoped to the current BLOCK."
>
>So $_[0] is an alias to the $1 in the current block 

That's the point: why is $_[0] an alias to the $1 in th current block?
It really shoud be an alias to the $1 that exists outside of the sub
and so the value of $_[0] shouldn't change when a regex in the sub is used.

Cheers

Heinrich

-- 
Heinrich Mislik
Zentraler Informatikdienst der Universitaet Wien
A-1010 Wien, Universitaetsstrasse 7
Tel.: (+43 1) 4277-14056, Fax: (+43 1) 4277-9140



------------------------------

Date: Thu, 15 Oct 2009 22:46:54 +0000 (UTC)
From: Ilya Zakharevich <nospam-abuse@ilyaz.org>
Subject: Re: Strange behavoiur when passing $1 to a sub
Message-Id: <slrnhdf9it.qpf.nospam-abuse@chorin.math.berkeley.edu>

On 2009-10-12, smallpond <smallpond@juno.com> wrote:
> perldoc perlvar
> $<digits> "These variables are all read-only and dynamically
> scoped to the current BLOCK."

As usual with Perl docs, this is complete BS.

> So $_[0] is an alias to the $1 in the current block

There is no "$1 in the current block".  There is exactly one $1.  (Its
VALUE is RESTORED when a block ends.)

One should never pass $N variables to subroutines any other way than

  f("$3")

Hope this helps,
Ilya


------------------------------

Date: 16 Oct 2009 09:58:24 GMT
From: Heinrich.Mislik@univie.ac.at (Heinrich Mislik)
Subject: Re: Strange behavoiur when passing $1 to a sub
Message-Id: <4ad843c0$0$11610$3b214f66@usenet.univie.ac.at>

In article <slrnhdf9it.qpf.nospam-abuse@chorin.math.berkeley.edu>, nospam-abuse@ilyaz.org says...

>There is no "$1 in the current block".  There is exactly one $1.  (Its
>VALUE is RESTORED when a block ends.)
>
>One should never pass $N variables to subroutines any other way than
>
>  f("$3")
>
>Hope this helps,

Thanks, yes, things get clear now. Maybe the text for $<digits> in
perldoc perlvar should point to "Temporary Values via local()" in
perldoc perlsub. Thats where "dynamic scoping" is explained in full.

Cheers Heinrich

-- 
Heinrich Mislik
Zentraler Informatikdienst der Universitaet Wien
A-1010 Wien, Universitaetsstrasse 7
Tel.: (+43 1) 4277-14056, Fax: (+43 1) 4277-9140



------------------------------

Date: Sat, 17 Oct 2009 07:23:52 +0000 (UTC)
From: Ilya Zakharevich <nospam-abuse@ilyaz.org>
Subject: Re: Strange behavoiur when passing $1 to a sub
Message-Id: <slrnhdis88.9o9.nospam-abuse@chorin.math.berkeley.edu>

On 2009-10-16, Heinrich Mislik <Heinrich.Mislik@univie.ac.at> wrote:
> In article <slrnhdf9it.qpf.nospam-abuse@chorin.math.berkeley.edu>, nospam-abuse@ilyaz.org says...
>
>>There is no "$1 in the current block".  There is exactly one $1.  (Its
>>VALUE is RESTORED when a block ends.)
>>
>>One should never pass $N variables to subroutines any other way than
>>
>>  f("$3")
>>
>>Hope this helps,
>
> Thanks, yes, things get clear now. Maybe the text for $<digits> in
> perldoc perlvar should point to "Temporary Values via local()" in
> perldoc perlsub. Thats where "dynamic scoping" is explained in full.

Will not work too.  The semantic of $1 is different from two other
types of localization: via `local *foo' and via `local $foo'.  Both
latter variants produce "new VARIABLES".  $N have "new VALUES" (IIRC).

IIRC, one of the checks is printing out references to the variables...

Yours,
Ilya


------------------------------

Date: Fri, 16 Oct 2009 13:33:26 GMT
From: dan <spam.meplease@ntlworld.com>
Subject: Triple escape apostrophes
Message-Id: <GC_Bm.5461$XR1.1328@newsfe26.ams2>

Solution

Having trouble with a javascript alert which contained an apostrophe 
within a perl CGI script, through trial and error I eventually found out 
that triple escaping the apostrophe works. No idea why.

#!/usr/bin/perl -T
use CGI qw/:standard/;

my $JSCRIPT=<<EOF;
  function alertme() {
    alert('apostrophe\\\'s');
  }
EOF
;

print 
  header,
  start_html( -script => $JSCRIPT ),
  start_form(),
  submit( -onClick=> "alertme()" ),
  end_form,
  end_html
;


------------------------------

Date: Fri, 16 Oct 2009 06:40:40 -0700 (PDT)
From: Paul Lalli <mritty@gmail.com>
Subject: Re: Triple escape apostrophes
Message-Id: <92a08782-851e-46bd-bed2-a660449a0acc@m11g2000yqf.googlegroups.com>

On Oct 16, 9:33=A0am, dan <spam.meple...@ntlworld.com> wrote:
> Solution
>
> Having trouble with a javascript alert which contained an apostrophe
> within a perl CGI script, through trial and error I eventually found out
> that triple escaping the apostrophe works. No idea why.
>
> #!/usr/bin/perl -T
> use CGI qw/:standard/;
>
> my $JSCRIPT=3D<<EOF;
> =A0 function alertme() {
> =A0 =A0 alert('apostrophe\\\'s');
> =A0 }
> EOF

When you use a here-doc without any quotes around the here-doc marker,
perl interprets it as being double quoted.  So what you wrote is no
different than:

my $JSCRIPT =3D "  function alertme() {\n    alert('apostrophe\\\'s');
\n  }\n";

Since that string is in double quotes, any backslashes in it need to
be escaped.  So your first two "\\" reduce to a single backslash.
Your next "\'" reduce to a single apostrophe.  Therefore, what ends up
printed to your browser is

alert('apostrophe\'s')

That single slash is needed by javascript to escape the apostrophe,
since the apostrophe is also the string delimeter.

You could reduce the number of slashes by putting single quotes around
your heredoc marker, so that Perl treats it as a single-quoted string
rather than a double-quoted string

my $JSCRIPT=3D<<'EOF'
  whateverwhatever
EOF

Paul Lalli


------------------------------

Date: Fri, 16 Oct 2009 07:24:05 -0700 (PDT)
From: "jl_post@hotmail.com" <jl_post@hotmail.com>
Subject: Re: Trying to parse/match a C string literal
Message-Id: <1902c5f6-d1c8-4d32-a51f-12f0cf1b090c@j4g2000yqa.googlegroups.com>

On Sep 25, 10:02 pm, s...@netherlands.com wrote:
> (?x-ism:" ( (?: \\?. )*? ) ")
> the code took:2.84375 wallclock secs ( 2.84 usr +  0.00 sys =  2.84 CPU)
>
> (?x-ism:" (.*? (?<!\\) (?:\\{2})* ) ")
> the code took:2.62468 wallclock secs ( 2.62 usr +  0.00 sys =  2.62 CPU)
>
> (?x-ism:" ( (?: [^\\"] | \\. )* ) ")
> the code took:2.14033 wallclock secs ( 2.14 usr +  0.00 sys =  2.14 CPU)
>
> (?x-ism: "  ( (?: [^"\\]+ | (?:\\.)+ )* )  " )
> the code took:1.74956 wallclock secs ( 1.75 usr +  0.00 sys =  1.75 CPU)


   Thanks for the benchmark code, sln.  I ran it myself and pretty
much got the same results (in that their rankings in speed were the
same).  Which puzzles me, because when I ran my code against real-
world data, it showed that:

   m/" (.*? (?<!\\) (?:\\{2})* ) "/x

and:

   m/" ( (?: [^\\"] | \\. )* ) "/x

were clearly the fastest.  (That is, in contrast to your benchmark
code, which shows that:

   (?x-ism: "  ( (?: [^"\\]+ | (?:\\.)+ )* )  " )

is the fastest.)

   I did discover, however, that if I took your sample input (the part
after the __DATA__ statement) and removed all embedded double-quotes
and re-ran your benchmark program, then the "slower" regular
expressions started catching up to the faster ones.

   The only thing I can think of is that the efficiency of the regular
expressions can change depending on how many escaped quotes (and other
escaped characters) there are in the string that's examined.  And
since I was parsing through log messages (meant for the end user, not
the programmer), escaped quotes were fairly uncommon.  (They did
exist, but rarely as more than one pair at a time.)

   So in the end, it really depends on the data itself.  And the best
way to correctly "simulate" processing the input data is to process on
the input data itself (if that makes any sense).

   At any rare, thanks for your hard work in investigating this for
me, sln.

   -- Jean-Luc


------------------------------

Date: Sun, 18 Oct 2009 15:18:12 -0700
From: sln@netherlands.com
Subject: Re: Trying to parse/match a C string literal
Message-Id: <pl3nd55qffu4r42d1t8u5td7h9kt4ka5jd@4ax.com>

On Fri, 16 Oct 2009 07:24:05 -0700 (PDT), "jl_post@hotmail.com" <jl_post@hotmail.com> wrote:

>On Sep 25, 10:02 pm, s...@netherlands.com wrote:
>> (?x-ism:" ( (?: \\?. )*? ) ")
>> the code took:2.84375 wallclock secs ( 2.84 usr +  0.00 sys =  2.84 CPU)
>>
>> (?x-ism:" (.*? (?<!\\) (?:\\{2})* ) ")
>> the code took:2.62468 wallclock secs ( 2.62 usr +  0.00 sys =  2.62 CPU)
>>
>> (?x-ism:" ( (?: [^\\"] | \\. )* ) ")
>> the code took:2.14033 wallclock secs ( 2.14 usr +  0.00 sys =  2.14 CPU)
>>
>> (?x-ism: "  ( (?: [^"\\]+ | (?:\\.)+ )* )  " )
>> the code took:1.74956 wallclock secs ( 1.75 usr +  0.00 sys =  1.75 CPU)
>
>
>   Thanks for the benchmark code, sln.  I ran it myself and pretty
>much got the same results (in that their rankings in speed were the
>same).  Which puzzles me, because when I ran my code against real-
>world data, it showed that:
>
>   m/" (.*? (?<!\\) (?:\\{2})* ) "/x
>
>and:
>
>   m/" ( (?: [^\\"] | \\. )* ) "/x
>
>were clearly the fastest.  (That is, in contrast to your benchmark
>code, which shows that:
>
>   (?x-ism: "  ( (?: [^"\\]+ | (?:\\.)+ )* )  " )
>
>is the fastest.)
>
>   I did discover, however, that if I took your sample input (the part
>after the __DATA__ statement) and removed all embedded double-quotes
>and re-ran your benchmark program, then the "slower" regular
>expressions started catching up to the faster ones.
>
>   The only thing I can think of is that the efficiency of the regular
>expressions can change depending on how many escaped quotes (and other
>escaped characters) there are in the string that's examined.  And
>since I was parsing through log messages (meant for the end user, not
>the programmer), escaped quotes were fairly uncommon.  (They did
>exist, but rarely as more than one pair at a time.)
>
>   So in the end, it really depends on the data itself.  And the best
>way to correctly "simulate" processing the input data is to process on
>the input data itself (if that makes any sense).

I would take acception to the 'real world' scenario.
How are you using this? Is this quote regex just a sub-expresion in a
larger regex?

I re-ran the tests taking out any escaped characters (I actually did before too)
and matched just one time instead of global.

I also ran the benches on a single large '.cpp' file with about 100 medium-large
strings with some scattered '\'ed characters. There is about a %25 performance increase
with the (?x-ism: "  ( (?: [^"\\]+ | (?:\\.)+ )* )  " ) still, compared to the next
fastest.

I think the percentage difference is linear and related to the number of sub-matches
within the " " anchors. If there are no " in the sample, the numbers are
identical (as you would expect).

Below is a simple few line program that does regex's on the two fastests and
includes regular expression debug information output. It simply reads the first
line of DATA and does regex on it.

Below that line is my educated guess as to why these results are why they are.
Below that is the output of the simple program.

-sln
---------------------------

use strict;
use warnings;

use re "debug";

my $string = <DATA>;  # just get 1 line

$string =~ /"((?:[^"\\]+|(?:\\.)+)*)"/;

$string =~ /"((?:[^\\"]|\\.)*)"/;

__DATA__
1 "this one" 


Analysis
-------------
Intuitively, (?: | )*  is a complex grouping.
It says do the contents 0 or more times in a *loop*.

After each loop itteration, the next anchor (if there is one)
past the group is checked.

Within the group, if you are only checking for one simple character
at a time, the total time is:

  (1 char check + 1 loop check per character) X the number of characters that pass
  example:  "abbbcccd" =~ /a(?:[bc])*d/;

However, if you check for multiple characters within the loop at a time,
the total time is:

  (1 char check X the number of characters that pass) + 1 or 0 loop check
  example:  "abbbcccd" =~ /a(?:[bc]+)*d/;

Since the sum of all character checks only include 1 or 0 loop check (without backtracking),
the total time is reduced.

The regex engine initially processes anchors, it finds out where they are
then allocates a span (as a limit) of characters between them, from
which to process variable data. This is called an optimization.

In this particular case /a(?:[bc]+)*d/, the anchors are 'a' and 'd'.
The engine finds these characters in the sample, measures the distance
between them and allows the simple character class [bc] to match up to
that span amount (because of the '+' and can be less, but no more) before
it does a single loop check.

It is always possible that the regex engine will do more complex optimizations
depending on the surrounding sub-expressions. When in doubt run a simple test
using the pragma:    use re 'debug';
Usually, the more steps (lines of output) it takes, the longer it takes.

See the Docs -
C:\Perl\html\lib\pods\perldebguts.html
search for: "Debugging regular expressions"
----------------------------------------------

Output:
-----------------
Compiling REx "%"((?:[^%"\\]+|(?:\\.)+)*)%""
Final program:
   1: EXACT <"> (3)
   3: OPEN1 (5)
   5:   CURLYX[0] {0,32767} (30)
   7:     BRANCH (20)
   8:       PLUS (29)
   9:         ANYOF[\0-!#-[\]-\377{unicode_all}] (0)
  20:     BRANCH (FAIL)
  21:       CURLYM[0] {1,32767} (29)
  23:         EXACT <\\> (25)
  25:         REG_ANY (26)
  26:         SUCCEED (0)
  27:       NOTHING (29)
  28:     TAIL (29)
  29:   WHILEM[1/2] (0)
  30:   NOTHING (31)
  31: CLOSE1 (33)
  33: EXACT <"> (35)
  35: END (0)
anchored "%"" at 0 floating "%"" at 1..2147483647 (checking floating) minlen 2
Compiling REx "%"((?:[^\\%"]|\\.)*)%""
Final program:
   1: EXACT <"> (3)
   3: OPEN1 (5)
   5:   CURLYX[0] {0,32767} (25)
   7:     BRANCH (19)
   8:       ANYOF[\0-!#-[\]-\377{unicode_all}] (24)
  19:     BRANCH (FAIL)
  20:       EXACT <\\> (22)
  22:       REG_ANY (24)
  23:     TAIL (24)
  24:   WHILEM[1/1] (0)
  25:   NOTHING (26)
  26: CLOSE1 (28)
  28: EXACT <"> (30)
  30: END (0)
anchored "%"" at 0 floating "%"" at 1..2147483647 (checking floating) minlen 2
Guessing start of match in sv for REx "%"((?:[^%"\\]+|(?:\\.)+)*)%"" against "1
%"this one%" %n"
Found floating substr "%"" at offset 2...
Contradicts anchored substr "%"", trying floating at offset 3...
Found floating substr "%"" at offset 11...
Found anchored substr "%"" at offset 2...
Starting position does not contradict /^/m...
Guessed: match at offset 2
Matching REx "%"((?:[^%"\\]+|(?:\\.)+)*)%"" against "%"this one%" %n"
   2 <1 > <"this one">       |  1:EXACT <">(3)
   3 <1 "> <this one" >      |  3:OPEN1(5)
   3 <1 "> <this one" >      |  5:CURLYX[0] {0,32767}(30)
   3 <1 "> <this one" >      | 29:  WHILEM[1/2](0)
                                    whilem: matched 0 out of 0..32767
   3 <1 "> <this one" >      |  7:    BRANCH(20)
   3 <1 "> <this one" >      |  8:      PLUS(29)
                                        ANYOF[\0-!#-[\]-\377{unicode_all}] can m
atch 8 times out of 2147483647...
  11 <"this one> <" %n>      | 29:        WHILEM[1/2](0)
                                          whilem: matched 1 out of 0..32767
  11 <"this one> <" %n>      |  7:          BRANCH(20)
  11 <"this one> <" %n>      |  8:            PLUS(29)
                                              ANYOF[\0-!#-[\]-\377{unicode_all}]
 can match 0 times out of 2147483647...
                                              failed...
  11 <"this one> <" %n>      | 20:          BRANCH(28)
  11 <"this one> <" %n>      | 21:            CURLYM[0] {1,32767}(29)
  11 <"this one> <" %n>      | 23:              EXACT <\\>(25)
                                                failed...
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...

  11 <"this one> <" %n>      | 30:          NOTHING(31)
  11 <"this one> <" %n>      | 31:          CLOSE1(33)
  11 <"this one> <" %n>      | 33:          EXACT <">(35)
  12 <"this one"> < %n>      | 35:          END(0)
Match successful!
Guessing start of match in sv for REx "%"((?:[^\\%"]|\\.)*)%"" against "1 %"this
 one%" %n"
Found floating substr "%"" at offset 2...
Contradicts anchored substr "%"", trying floating at offset 3...
Found floating substr "%"" at offset 11...
Found anchored substr "%"" at offset 2...
Starting position does not contradict /^/m...
Guessed: match at offset 2
Matching REx "%"((?:[^\\%"]|\\.)*)%"" against "%"this one%" %n"
   2 <1 > <"this one">       |  1:EXACT <">(3)
   3 <1 "> <this one" >      |  3:OPEN1(5)
   3 <1 "> <this one" >      |  5:CURLYX[0] {0,32767}(25)
   3 <1 "> <this one" >      | 24:  WHILEM[1/1](0)
                                    whilem: matched 0 out of 0..32767
   3 <1 "> <this one" >      |  7:    BRANCH(19)
   3 <1 "> <this one" >      |  8:      ANYOF[\0-!#-[\]-\377{unicode_all}](24)
   4 <1 "t> <his one" >      | 24:      WHILEM[1/1](0)
                                        whilem: matched 1 out of 0..32767
   4 <1 "t> <his one" >      |  7:        BRANCH(19)
   4 <1 "t> <his one" >      |  8:          ANYOF[\0-!#-[\]-\377{unicode_all}](2
4)
   5 <1 "th> <is one" %n>    | 24:          WHILEM[1/1](0)
                                            whilem: matched 2 out of 0..32767
   5 <1 "th> <is one" %n>    |  7:            BRANCH(19)
   5 <1 "th> <is one" %n>    |  8:              ANYOF[\0-!#-[\]-\377{unicode_all
}](24)
   6 < "thi> <s one" %n>     | 24:              WHILEM[1/1](0)
                                                whilem: matched 3 out of 0..3276
7
   6 < "thi> <s one" %n>     |  7:                BRANCH(19)
   6 < "thi> <s one" %n>     |  8:                  ANYOF[\0-!#-[\]-\377{unicode
_all}](24)
   7 <"this> < one" %n>      | 24:                  WHILEM[1/1](0)
                                                    whilem: matched 4 out of 0..
32767
   7 <"this> < one" %n>      |  7:                    BRANCH(19)
   7 <"this> < one" %n>      |  8:                      ANYOF[\0-!#-[\]-\377{uni
code_all}](24)
   8 <"this > <one" %n>      | 24:                      WHILEM[1/1](0)
                                                        whilem: matched 5 out of
 0..32767
   8 <"this > <one" %n>      |  7:                        BRANCH(19)
   8 <"this > <one" %n>      |  8:                          ANYOF[\0-!#-[\]-\377
{unicode_all}](24)
   9 <"this o> <ne" %n>      | 24:                          WHILEM[1/1](0)
                                                            whilem: matched 6 ou
t of 0..32767
   9 <"this o> <ne" %n>      |  7:                            BRANCH(19)
   9 <"this o> <ne" %n>      |  8:                              ANYOF[\0-!#-[\]-
\377{unicode_all}](24)
  10 <"this on> <e" %n>      | 24:                              WHILEM[1/1](0)
                                                                whilem: matched
7 out of 0..32767
  10 <"this on> <e" %n>      |  7:                                BRANCH(19)
  10 <"this on> <e" %n>      |  8:                                  ANYOF[\0-!#-
[\]-\377{unicode_all}](24)
  11 <"this one> <" %n>      | 24:                                  WHILEM[1/1](
0)
                                                                    whilem: matc
hed 8 out of 0..32767
  11 <"this one> <" %n>      |  7:                                    BRANCH(19)

  11 <"this one> <" %n>      |  8:                                      ANYOF[\0
-!#-[\]-\377{unicode_all}](24)
                                                                        failed..
 .
  11 <"this one> <" %n>      | 19:                                    BRANCH(23)

  11 <"this one> <" %n>      | 20:                                      EXACT <\
\>(22)
                                                                        failed..
 .
                                                                      BRANCH fai
led...
                                                                    whilem: fail
ed, trying continuation...
  11 <"this one> <" %n>      | 25:                                    NOTHING(26
)
  11 <"this one> <" %n>      | 26:                                    CLOSE1(28)

  11 <"this one> <" %n>      | 28:                                    EXACT <">(
30)
  12 <"this one"> < %n>      | 30:                                    END(0)
Match successful!
Freeing REx: "%"((?:[^%"\\]+|(?:\\.)+)*)%""
Freeing REx: "%"((?:[^\\%"]|\\.)*)%""





------------------------------

Date: Mon, 19 Oct 2009 21:08:40 +0800
From: Hongyi Zhao <hongyi.zhao@gmail.com>
Subject: Want to write a script to note specific IP addresses.
Message-Id: <95pod5hh1pcv9ej4to6k700921p96ljctp@4ax.com>

Hi all,

I want to write a script to note specific IP
 addresses by appending the corresponding location informations.  For
detail, I describe my issue as follows:

Suppose I have two files, the first file is used to store the specific
IP
 addresses which I want to note, and the second file is used to store
the IP database along with the corresponding location informations.

The first file has one IP address per line with dotted decimal format,
e.g.:

0.125.125.125
4.19.79.28
4.36.124.150
 ...

The second file has four field per line delimited by CHARACTER
TABULATION (U+0009).  These four field are: StartIP, EndIP, Country,
and Local, e.g.:

StartIP	EndIP	Country	Local
0.0.0.0	0.255.255.255	IANA	CZ88.NET
4.19.79.0	4.19.79.63	American	Armed Forces
Radio/Television
4.36.124.128	4.36.124.255	American	Technical Resource
Connections Inc
 ...

Based on the second file, I want to reformat the first file by
appending the corresponding location informations for each IP address
in it, i.e., for the above example, I want to obain the following
result:

0.125.125.125#IANA CZ88.NET
4.19.79.28#American Armed Forces Radio/Television
4.36.124.150#American Technical Resource Connections
 ...

Any hints on this issue will be highly appreciated.
Thanks in advance.
-- 
 .: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.


------------------------------

Date: Mon, 19 Oct 2009 07:28:10 -0700
From: sln@netherlands.com
Subject: Re: Want to write a script to note specific IP addresses.
Message-Id: <bbtod5tun9bgen1odllnpke7e1kfa3j7l4@4ax.com>

On Mon, 19 Oct 2009 21:08:40 +0800, Hongyi Zhao <hongyi.zhao@gmail.com> wrote:

>Hi all,
>
>I want to write a script to note specific IP
> addresses by appending the corresponding location informations.  For
>detail, I describe my issue as follows:
>
>Suppose I have two files, the first file is used to store the specific
>IP
> addresses which I want to note, and the second file is used to store
>the IP database along with the corresponding location informations.
>
>The first file has one IP address per line with dotted decimal format,
>e.g.:
>
>0.125.125.125
>4.19.79.28
>4.36.124.150
>...
>
>The second file has four field per line delimited by CHARACTER
>TABULATION (U+0009).  These four field are: StartIP, EndIP, Country,
>and Local, e.g.:
>
>StartIP	EndIP	Country	Local
>0.0.0.0	0.255.255.255	IANA	CZ88.NET
>4.19.79.0	4.19.79.63	American	Armed Forces
>Radio/Television
>4.36.124.128	4.36.124.255	American	Technical Resource
>Connections Inc
>...
>
>Based on the second file, I want to reformat the first file by
>appending the corresponding location informations for each IP address
>in it, i.e., for the above example, I want to obain the following
>result:
>
>0.125.125.125#IANA CZ88.NET
>4.19.79.28#American Armed Forces Radio/Television
>4.36.124.150#American Technical Resource Connections
>...
>
>Any hints on this issue will be highly appreciated.
>Thanks in advance.

I can give you a hint.
Create a Database table with four fields:
  StartIP, EndIP, Country, and Local

Itterate random IP list, write sql per list
  SELECT * tbl WHERE randomip >= tbl.StartIP AND randomip <= tbl.EndIP

-sln


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2642
***************************************


home help back first fref pref prev next nref lref last post