[30716] in Perl-Users-Digest
Perl-Users Digest, Issue: 1961 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Nov 4 00:09:46 2008
Date: Mon, 3 Nov 2008 21:09:07 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Mon, 3 Nov 2008 Volume: 11 Number: 1961
Today's topics:
Re: /^From:.*?([\w.-]+@[\w.-]+)/ <someone@example.com>
Re: /^From:.*?([\w.-]+@[\w.-]+)/ <tim@burlyhost.com>
Re: /^From:.*?([\w.-]+@[\w.-]+)/ <tim@burlyhost.com>
Re: /^From:.*?([\w.-]+@[\w.-]+)/ sln@netherlands.com
Re: /^From:.*?([\w.-]+@[\w.-]+)/ <someone@example.com>
Re: /^From:.*?([\w.-]+@[\w.-]+)/ <xiaoxia2005a@yahoo.com>
Re: /^From:.*?([\w.-]+@[\w.-]+)/ <xiaoxia2005a@yahoo.com>
Re: /^From:.*?([\w.-]+@[\w.-]+)/ <uri@stemsystems.com>
Re: /^From:.*?([\w.-]+@[\w.-]+)/ <xiaoxia2005a@yahoo.com>
Re: A couple of questions regarding runtime generation sln@netherlands.com
Re: A couple of questions regarding runtime generation sln@netherlands.com
Re: How to get the memory address of a Perl variable in <sisyphus359@gmail.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Mon, 03 Nov 2008 12:03:08 -0800
From: "John W. Krahn" <someone@example.com>
Subject: Re: /^From:.*?([\w.-]+@[\w.-]+)/
Message-Id: <0OIPk.11338$kd5.11135@newsfe01.iad>
Tim Greer wrote:
> April wrote:
>=20
>> On Nov 3, 1:37=C2 am, Tim Greer <t...@burlyhost.com> wrote:
>>>> I've started to love this place and you guys .. :-)
>>> BTW, if you know it'll only be white space (space, tabs, etc.)
>>> between the ^From:? and email@address, then \s+ would probably be a
>>> better idea... unless you suspect other non \w, ., and - characters
>>> will exist between it and don't want to try and predict them.
>>
>> you mean '^From:\s+?', how about '^From:\s*?', to also cover the case
>> no white space or anything at all?
>=20
> Yes, if it might have white space or might have none at all, then \s*
> for zero or more is what you want. \s*? isn't necessary here, since
> \s* is already zero or more, so making it an optional match doesn't
> matter,
The ? on the end of \s*? changes \s* to non-greedy, the * makes it option=
al.
> since if it doesn't exist, it's already "zero". Be sure to
> make : optional on From though, since your examples don't have it each
> time.
John
--=20
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
------------------------------
Date: Mon, 03 Nov 2008 12:23:19 -0800
From: Tim Greer <tim@burlyhost.com>
Subject: Re: /^From:.*?([\w.-]+@[\w.-]+)/
Message-Id: <X4JPk.1491$1r1.239@newsfe01.iad>
John W. Krahn wrote:
> Tim Greer wrote:
>> April wrote:
>>
>>> On Nov 3, 1:37Â am, Tim Greer <t...@burlyhost.com> wrote:
>>>>> I've started to love this place and you guys .. :-)
>>>> BTW, if you know it'll only be white space (space, tabs, etc.)
>>>> between the ^From:? and email@address, then \s+ would probably be a
>>>> better idea... unless you suspect other non \w, ., and - characters
>>>> will exist between it and don't want to try and predict them.
>>>
>>> you mean '^From:\s+?', how about '^From:\s*?', to also cover the
>>> case no white space or anything at all?
>>
>> Yes, if it might have white space or might have none at all, then \s*
>> for zero or more is what you want. \s*? isn't necessary here, since
>> \s* is already zero or more, so making it an optional match doesn't
>> matter,
>
> The ? on the end of \s*? changes \s* to non-greedy, the * makes it
> optional.
Right, I know. However, they'll never want to capture any of the white
space between From:? and ([\w.-]+@[\w.-]+), so \s* should suffice and
doesn't require non greedy to function as expected.
--
Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
Industry's most experienced staff! -- Web Hosting With Muscle!
------------------------------
Date: Mon, 03 Nov 2008 12:26:13 -0800
From: Tim Greer <tim@burlyhost.com>
Subject: Re: /^From:.*?([\w.-]+@[\w.-]+)/
Message-Id: <N7JPk.1493$1r1.1482@newsfe01.iad>
Tim Greer wrote:
> John W. Krahn wrote:
>
>> Tim Greer wrote:
>>> April wrote:
>>>
>>>> On Nov 3, 1:37Â am, Tim Greer <t...@burlyhost.com> wrote:
>>>>>> I've started to love this place and you guys .. :-)
>>>>> BTW, if you know it'll only be white space (space, tabs, etc.)
>>>>> between the ^From:? and email@address, then \s+ would probably be
>>>>> a better idea... unless you suspect other non \w, ., and -
>>>>> characters will exist between it and don't want to try and predict
>>>>> them.
>>>>
>>>> you mean '^From:\s+?', how about '^From:\s*?', to also cover the
>>>> case no white space or anything at all?
>>>
>>> Yes, if it might have white space or might have none at all, then
>>> \s*
>>> for zero or more is what you want. \s*? isn't necessary here, since
>>> \s* is already zero or more, so making it an optional match doesn't
>>> matter,
>>
>> The ? on the end of \s*? changes \s* to non-greedy, the * makes it
>> optional.
>
> Right, I know. However, they'll never want to capture any of the
> white space between From:? and ([\w.-]+@[\w.-]+), so \s* should
> suffice and doesn't require non greedy to function as expected.
Pardon, to be more specific, I misused the word capture (that's
obvious). I simply mean that they appear to want to match all white
space (zero or as many as there is), and don't need to use a non greedy
match there. Not that it would matter, but it's not necessary to use
from what I can see.
--
Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
Industry's most experienced staff! -- Web Hosting With Muscle!
------------------------------
Date: Mon, 03 Nov 2008 20:59:25 GMT
From: sln@netherlands.com
Subject: Re: /^From:.*?([\w.-]+@[\w.-]+)/
Message-Id: <8geug4dpfa3q46t1igh2g7ql5unts11bfs@4ax.com>
On Sun, 2 Nov 2008 20:03:20 -0800 (PST), April <xiaoxia2005a@yahoo.com> wrote:
>On Nov 2, 8:34 pm, s...@netherlands.com wrote:
>>
>> Since you are asking this question, it is not clear to you at all April.
>>
>
>you really know me, however with your inspiration, I'm pretty sure
>I'll be getting better sooner.
>
>> Look at that expression. '[\w.-]+@' is a hard anchor. That must be satisfied
>> first, especially the '@'.
>
>not sure I agree with this and the following Church ranking thing ...
>
>>
>> The fact is '[\w.-]+' can be satisfied with a single character.
>
>agree.
>
>> The other fact is '.*?' can get by with one character but it is in the bottom
>> of precedence.
>
>believe '.*?' can get by with 0 character too.
>
Yes, it will take no character, its a filter, but its a one character
at a time filter.
>> However '.*' wants to take as much of the string as possible,
>
>agree, but will still check to see whether that will allow the
>following [\w.-] to be satisfied.
>
Yes, but '[\w.-]+' can be satisfied with a single character.
Thus '.*' will grab all before that single character in a greedy fashion.
This will take as long as the non-greedy but will not get the right results.
>>
>> Here is the heirchy from top down:
>>
>> 1 - '@' is GOD
>> 2 - '[\w.-]+' is CHRIST
>> 3 - '.*' is the greedy HOLY GHOST
>> 4 - '.*?' is the single ANGEL
>>
[snip]
>Just foud and read "Regular Expression Tutorial Part 5: Greedy and Non-
>Greedy Quantification" by Andrew Johnson (which can be found on the
>Internet by searching). Andrew provides a pretty convencing
>explanation on how '.*?' works. I believe the use of '.*?' will take
>care of no space, one or more other characters, including space, tab,
>etc., that appear before the real email address but are not matched
>by [\w.-].
>
I never read that book. Its probably good.
In my experience, negative greed is one the most usefull concepts.
I always look to add negative greed to expressions.
In terms of greed, once the engine knows what not to look for, it will
grab all it can up to that point. Then it will look at the next term
in the regex expression.
This is the same as non-greedy, but the greedy one grabs a chunk of
matched data at a time, where as the non-greedy will grab one occurance
at a time. They both then check the next term for a match.
Knowing this, you can shorten the time the data takes to process.
Example:
$data = "From: ]]]][[[[*****\\ -2ame\@yahoo.com";
$data =~ /^From:[^\w.-]*([\w.-]+@[\w.-]+)/
is about %130 faster (2-3x faster) than this
$data =~ /^From:.*?([\w.-]+@[\w.-]+)/
The reason is that the engine grabs the greedy chunk first.
It just so happens we stopped the greed at a boundry where
the next character \w satisfies the next term '[\w-]+'.
Non-greedy will only get one character at a time between checks if the
next character will satisfy the next term '[\w-]+'. The repeated itteration
consumes a very large chunk of processing time.
The more a non-greedy term has to process the longer it takes. It could be
non-linear as well, not sure.
If the above were '$data = "From: -2ame\@yahoo.com";', the processing time's
would be equal. The more '.*?' characters, the longer time it takes.
There are times when you don't know where exactly to stop the greed,
but by all means possible, try to let the greed be there. Just have
to think about it and test all possible scenario's.
In a looping scenario, say like a parser, where everything is processed in a
repeating fashion, there is usually a sink/filter that picks up waste/comments
or formatting data, typically takes on the '.*?' form. This typically gives
the patterns a chance to match on the next character.
If there is 1,2 or 3 characters that start out the pattern matches, a greedy
term (negative) can take you up to them quickly, giving the pattern's a chance
to match without checking at character intervals. In that case you can use negative
greed and just simply have to know how to get past those characters in case the
patterns don't match.
Typically:
$lcbpos = 0
while (/<($pat1|$pat2|$pat3)>|([^<]*)(<?)/g) {
if (defined $2) {
if (length($3) && $lcbpos != pos($_)) {
$lcbpos = pos($_);
pos($_) = $lcbpos - 1;
}
next;
}
# found pattern
So negative greed is a good thing indeed. Its advisable to always try to be greedy.
But, this is un-avoidable sometimes: /ANCHOR's.*?AWAY/
Here are some benchmarks concerning greed and your email regexp.
--------------------------
use strict;
use warnings;
use Benchmark ':hireswallclock';
my $email = "From: ]]]][[[[*****\\ -2ame\@yahoo.com";
my ($result,$t0,$t1,$tdif) = '';
### Non-Greedy '.*?'
$t0 = new Benchmark;
for (1 .. 10000)
{
$email =~ /^From:.*?([\w.-]+@[\w.-]+)/;
}
$t1 = new Benchmark;
$tdif = timediff($t1, $t0);
print "\nNon-greedy '.*?' --\n the code took:",timestr($tdif),"\n";
### Greedy '[^\\w.-]*'
$t0 = new Benchmark;
for (1 .. 10000)
{
$email =~ /^From:[^\w.-]*([\w.-]+@[\w.-]+)/;
}
$t1 = new Benchmark;
$tdif = timediff($t1, $t0);
print "\nGreedy '[^\\w.-]*' --\n the code took:",timestr($tdif),"\n";
__END__
Non-greedy '.*?' --
the code took:0.03332 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
Greedy '[^\w.-]*' --
the code took:0.016902 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
--------------
sln
------------------------------
Date: Mon, 03 Nov 2008 13:03:13 -0800
From: "John W. Krahn" <someone@example.com>
Subject: Re: /^From:.*?([\w.-]+@[\w.-]+)/
Message-Id: <lGJPk.16048$AW.6087@newsfe01.iad>
Tim Greer wrote:
> Tim Greer wrote:
>
>> John W. Krahn wrote:
>>
>>> Tim Greer wrote:
>>>>
>>>> Yes, if it might have white space or might have none at all, then
>>>> \s*
>>>> for zero or more is what you want. \s*? isn't necessary here, since
>>>> \s* is already zero or more, so making it an optional match doesn't
>>>> matter,
>>>
>>> The ? on the end of \s*? changes \s* to non-greedy, the * makes it
>>> optional.
>>
>> Right, I know. However, they'll never want to capture any of the
>> white space between From:? and ([\w.-]+@[\w.-]+), so \s* should
>> suffice and doesn't require non greedy to function as expected.
>
> Pardon, to be more specific, I misused the word capture (that's
> obvious). I simply mean that they appear to want to match all white
> space (zero or as many as there is), and don't need to use a non greedy
> match there. Not that it would matter, but it's not necessary to use
> from what I can see.
Right, because the (optional) whitespace is anchored by 'm:?' on the
left and '[\w.-]+' on the right the greediness is irrelevant.
John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
------------------------------
Date: Mon, 3 Nov 2008 19:11:33 -0800 (PST)
From: April <xiaoxia2005a@yahoo.com>
Subject: Re: /^From:.*?([\w.-]+@[\w.-]+)/
Message-Id: <7c1cf5da-89a4-4440-80dc-0cf54838f2cb@c2g2000pra.googlegroups.com>
On Nov 3, 1:52=A0pm, Tim Greer <t...@burlyhost.com> wrote:
>
> Yes, if it might have white space or might have none at all, then \s*
> for zero or more is what you want. =A0\s*? isn't necessary here, since
> \s* is already zero or more, so making it an optional match doesn't
> matter, since if it doesn't exist, it's already "zero". =A0Be sure to
> make : optional on From though, since your examples don't have it each
> time.
that's right it cannot be greedy to anywhere as here the matching is
with white space.
by the way using ? to make : optional is also a good thinking.
? seems a pretty interesting quatifier in re. it relates to both
optional and non-greedy.
------------------------------
Date: Mon, 3 Nov 2008 19:13:06 -0800 (PST)
From: April <xiaoxia2005a@yahoo.com>
Subject: Re: /^From:.*?([\w.-]+@[\w.-]+)/
Message-Id: <67451207-a0ca-4b50-b189-644f9ed8cb91@v39g2000pro.googlegroups.com>
On Nov 3, 3:59=A0pm, s...@netherlands.com wrote:
>
> Non-greedy '.*?' --
> =A0the code took:0.03332 wallclock secs ( 0.03 usr + =A00.00 sys =3D =A00=
.03 CPU)
>
> Greedy '[^\w.-]*' --
> =A0the code took:0.016902 wallclock secs ( 0.03 usr + =A00.00 sys =3D =A0=
0.03 CPU)
impressive!
------------------------------
Date: Mon, 03 Nov 2008 22:18:32 -0500
From: Uri Guttman <uri@stemsystems.com>
Subject: Re: /^From:.*?([\w.-]+@[\w.-]+)/
Message-Id: <x7ljw041rr.fsf@mail.sysarch.com>
>>>>> "A" == April <xiaoxia2005a@yahoo.com> writes:
A> ? seems a pretty interesting quatifier in re. it relates to both
A> optional and non-greedy.
that is wrong thinking. one is a quantifier (0 or 1 of the previous
thing). the other is a modifier (makes the previous quantifier
non-greedy). don't assume any sort of relationship because of the use of
? for both of those roles.
<yes, i am back! :) verizon screwed over my usenet feed and i finally
switched to a free text one>
uri
--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
------------------------------
Date: Mon, 3 Nov 2008 19:37:37 -0800 (PST)
From: April <xiaoxia2005a@yahoo.com>
Subject: Re: /^From:.*?([\w.-]+@[\w.-]+)/
Message-Id: <777b1db0-f255-4de1-add0-609c87d7aabc@r37g2000prr.googlegroups.com>
On Nov 3, 10:18=A0pm, Uri Guttman <u...@stemsystems.com> wrote:
>
> =A0 A> ? seems a pretty interesting quatifier in re. =A0it relates to bot=
h
> =A0 A> optional and non-greedy.
>
> that is wrong thinking. one is a quantifier (0 or 1 of the previous
> thing). the other is a modifier (makes the previous quantifier
> non-greedy). don't assume any sort of relationship because of the use of
> ? for both of those roles.
>
ok.. welcome back then, though I don't know you were gone.
------------------------------
Date: Mon, 03 Nov 2008 22:00:01 GMT
From: sln@netherlands.com
Subject: Re: A couple of questions regarding runtime generation of REGEXP's
Message-Id: <m1tug4te6aot3stv0ti7l5eu3ool0155gt@4ax.com>
On Sun, 2 Nov 2008 21:49:03 -0600, Tad J McClellan <tadmc@seesig.invalid> wrote:
>sln@netherlands.com <sln@netherlands.com> wrote:
>
>> Basically I'm writing a sub that wants to take a regular
>> expression as a parameter. It then blindly operates on data,
>> matching, and posible substitution.
>>
>> Apparently qr// will only function on the matching side, something like this:
>
>
>"qr" stands for "quote regular expression" and the so called
>"matching side" of s/// is the part that is a regular expression.
>
>qr will work fine there.
>
>(the other "side" is the "replacement string", ie. it is not
>a regular expression at all.)
>
>
>> # does not work, no way no how
>
>
>Of course not. You are trying to quote something that is not
>a regular expression.
>
>
>> $rx = qr{s/\Q$sometext\E/junk/g};
>
>That regular expression will match if the string contains:
> an "s" character followed by
> a "/" character followed by
> the literal contents of $sometext followed by
> a "/" character followed by
> a "j" character followed by
> a "u" character followed by
> ...
>
>So that will match if:
>
> my $data = "s/$sometext/junk/g";
>
>
>> $data =~ $rx;
>
> my $rx = qr/\Q$sometext\E/; # quote only the regex part
> $data =~ s/$rx/junk/g; # works fine
>
>
>> And if it does compile, like the above does, it should work.
>
>
>It does work (but only if $data actually contains the characters listed above).
>
>
>> Is there anyway possible the substitution side will work?
>
>
>Yes. See above.
Thats clear, no suprises then.
Thanks!
sln
------------------------------
Date: Mon, 03 Nov 2008 23:01:35 GMT
From: sln@netherlands.com
Subject: Re: A couple of questions regarding runtime generation of REGEXP's
Message-Id: <s3tug451ob6lp9ordjqcath3kjm6eo83o0@4ax.com>
On Mon, 03 Nov 2008 14:14:52 +0100, Michele Dondi <bik.mido@tiscalinet.it> wrote:
>On Mon, 03 Nov 2008 00:24:30 GMT, sln@netherlands.com wrote:
>
>>I'm probably going to use some wrong terms here but I
>>hope to give enough detail that I can get a definative
>>resolution to this, once and for all.
>>
>>Basically I'm writing a sub that wants to take a regular
>>expression as a parameter. It then blindly operates on data,
>>matching, and posible substitution.
>[cut]
>># does not work, no way no how
>>$rx = qr{s/\Q$sometext\E/junk/g};
>
>Actually, this comes out oh so often! Others duly explained to you
>what's going on. Bottom line is, you *can't* "save" a substitution as
>a first order object of the language. The substitution part of a
>substitution, though, is "simply" a string: well, either that or code
>- if the /e modifier is supplied. In both cases you can *think* of it,
>possibly at the expense of a tiny wrapper layer, as a sub. Thus a
>solution to your problem, albeit not just as "slim" as you may have
>hoped for, may be given in terms of a couple consisting of a regex and
>a sub. Sounds reasonable?
>
>
>Michele
No matter how I look at it, the replacement is still a string-
constructed in the scope of the block that invokes regexp engine.
So s/.../$somereplacement$1$2$3/ can be valid.
Or s/.../somesub($1,$2,$3)/e can be valid.
And only qr// can be compiled ahead of =~ if constant, ie: the regular expression.
In this case (s)///(g) or //(g) has no meaning, nor does //(e) I take it,
because the (.) is not part of the regular expression, but some modifiers are like //i
because it acts on the regular expression.
To me then it is a misnomer to call this: 's/$regx/$txt/g' a regular expression since
it can't be known before a scope block that invokes it, but qr// can be.
In my opinion, s///g should be allowed by qr{} using the scoping block it was created
in, and later correctly used (s///g) within the context of a block that invokes the engine.
This may violate 'first-order object' of the language. But then why are code extensions allowed?
qr/(?{ code })/ and what is the scoping for them? To me this looks like parsing issues and
if allowed would would internally result in a dynamic code issue like eval.
I don't that this 'code' extension isn't treated as a literal anyway.
I don't know if invoking a 'sub' (/e) is going to be any better than having to
parse through a passed in argument list for the proper form. In all cases, it looks
like the replacement text cannot include special var's unles an eval is used
at runtime.
Can you give an example of your regex and a sub solution?
Thanks.
sln
------------------------------
Date: Mon, 3 Nov 2008 17:49:48 -0800 (PST)
From: sisyphus <sisyphus359@gmail.com>
Subject: Re: How to get the memory address of a Perl variable in XS
Message-Id: <ed9da400-90a5-4f79-8867-cb4066560ee7@b31g2000prf.googlegroups.com>
On Nov 3, 11:24=A0am, cyl <u8526...@gmail.com> wrote:
.
.
> How do I get the memory address of \@arr in xs_test?
Does the following return the appropriate value ?
int xs_test(SV * arref) {
return (int) arref;
}
Cheers,
Rob
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 1961
***************************************