[32755] in Perl-Users-Digest
Perl-Users Digest, Issue: 4019 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Aug 24 00:09:38 2013
Date: Fri, 23 Aug 2013 21:09:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Fri, 23 Aug 2013 Volume: 11 Number: 4019
Today's topics:
Re: Variable length lookbehind not implemented <rweikusat@mobileactivedefense.com>
Re: Variable length lookbehind not implemented <rweikusat@mobileactivedefense.com>
Re: Variable length lookbehind not implemented <derykus@gmail.com>
Re: Variable length lookbehind not implemented <rweikusat@mobileactivedefense.com>
Re: Variable length lookbehind not implemented <rweikusat@mobileactivedefense.com>
Re: Variable length lookbehind not implemented <derykus@gmail.com>
Re: Variable length lookbehind not implemented <derykus@gmail.com>
Re: Variable length lookbehind not implemented <derykus@gmail.com>
Re: Variable length lookbehind not implemented <derykus@gmail.com>
Re: Variable length lookbehind not implemented fmassion@web.de
Re: Variable length lookbehind not implemented <rweikusat@mobileactivedefense.com>
Re: Variable length lookbehind not implemented <rweikusat@mobileactivedefense.com>
Re: Variable length lookbehind not implemented <rweikusat@mobileactivedefense.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Thu, 22 Aug 2013 14:18:42 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <87ppt5n9sd.fsf@sapphire.mobileactivedefense.com>
Charles DeRykus <derykus@gmail.com> writes:
> On 8/21/2013 2:11 PM, Charles DeRykus wrote:
>> ....
>>
>> my text;
>> { undef $/; $text = <IN>;}
>>
>
> Better written: { local $/; $text = <IN>}
Adding the reason for that: local $/ creates a new binding for $/
which is dynamically scoped to the enclosing block (it has dynamic
extent and indefinite scope[*]). This implies that $/ reverts to its
former value after the enclosing block has finished executing. Except
in very 'controlled and limited' circumstance, this is preferable to
overwriting whatever the current value happens to be at the moment and
'leaking' this 'local policy descision' to the all code executeing
after the block.
[*] The Lisp-terminology[**] is somewhat lacking here because the
newly established binding is only visible to code which is reachable
via an execution path starting in the block and this will usually only
be a subset of all of the program code (in absence of travesties like
'execute a random function found via the symbol table of a random
package').
[**]
http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node43.html
------------------------------
Date: Thu, 22 Aug 2013 15:50:06 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <87ioyxn5k1.fsf@sapphire.mobileactivedefense.com>
fmassion@web.de writes:
> Sorry, I found a flaw in the expression:
>
> while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {
>
> If the text doesn't end with a tag, the last $out is not printed in:
> print $out, $in;
>
> The last printed character is a ">"
You could use a proper 'lexer' for HTML.
NB: This is something I just wrote down because I thought it couldn't
be that difficult. It is assumed that numbers which are part of a word
shouldn't be bracketed.
--------------
{
local $/;
$_ = <STDIN>;
}
my $in_tag;
{
unless ($in_tag) {
/\G</gc && do {
++$in_tag;
print('<');
redo;
};
/\G\b(\d+)\b/gc && do {
print("[$1]");
redo;
};
(/\G(\d+)/gc
|| /\G([^\d<]+)/gc) && do {
print($1);
redo;
};
} else {
/\G>/gc && do {
print('>');
--$in_tag;
redo;
};
/\G</gc && do {
print('<');
++$in_tag;
redo;
};
/\G([^<>]+)/gc && do {
print($1);
redo;
};
}
}
------------------------------
Date: Thu, 22 Aug 2013 09:18:41 -0700
From: Charles DeRykus <derykus@gmail.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <kv5dmd$ecr$1@speranza.aioe.org>
On 8/22/2013 6:05 AM, fmassion@web.de wrote:
> Sorry, I found a flaw in the expression:
>
> while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {
>
> If the text doesn't end with a tag, the last $out is not printed in:
> print $out, $in;
>
> The last printed character is a ">"
> We need somehow to find an expression whicht prints the remaining characters.
This might be a quick fix.. but again it's probably fragile
in many cases.
while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');
$out =~ s/(\d+)/[$1]/ag;
print $out,$in;
}
If unfamiliar with any of the above replacement regex items:
See: perldoc perlre # (?: ) and/or \z
perldoc perlop # \G and/or //
also perlre for the /a modifier
--
Charles DeRykus
------------------------------
Date: Thu, 22 Aug 2013 19:59:19 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <877gfdy2k8.fsf@sapphire.mobileactivedefense.com>
Charles DeRykus <derykus@gmail.com> writes:
> On 8/22/2013 6:05 AM, fmassion@web.de wrote:
>> Sorry, I found a flaw in the expression:
>>
>> while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {
>>
>> If the text doesn't end with a tag, the last $out is not printed in:
>> print $out, $in;
[...]
> This might be a quick fix.. but again it's probably fragile
> in many cases.
>
> while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
> my($out, $in) = ($1 // '', $2 // '');
> $out =~ s/(\d+)/[$1]/ag;
> print $out,$in;
> }
It will also replace numbers in words (which may or may not be
desired). Also, according to a quick test, using
while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {
works, too.
------------------------------
Date: Thu, 22 Aug 2013 22:01:22 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <8738q1xwwt.fsf@sapphire.mobileactivedefense.com>
Charles DeRykus <derykus@gmail.com> writes:
[...]
> while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
> my($out, $in) = ($1 // '', $2 // '');
Also according to a quick test I made, a () which matched an empty
string (this includes 'optional' ()s which didn't match anything)
causes an empty string to be put into the corresponding $n which
implies that the $1 // '' is not even useful as workaround for
less-than-useful perl runtime warnings.
------------------------------
Date: Thu, 22 Aug 2013 14:52:40 -0700
From: Charles DeRykus <derykus@gmail.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <kv618j$709$1@speranza.aioe.org>
On 8/22/2013 2:01 PM, Rainer Weikusat wrote:
> Charles DeRykus <derykus@gmail.com> writes:
>
> [...]
>
>> while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
>> my($out, $in) = ($1 // '', $2 // '');
>
> Also according to a quick test I made, a () which matched an empty
> string (this includes 'optional' ()s which didn't match anything)
> causes an empty string to be put into the corresponding $n which
> implies that the $1 // '' is not even useful as workaround for
> less-than-useful perl runtime warnings.
>
That's much better. (But, that's why I was careful to use the weasel
words "quick" and "fragile" when responding :)
And since the html's pedigree is unknown, an un-entified "<" causes
problems for both:
just a single un-entified < and any no. 1,2,... to \z vanish
You could add /c and take care of even that I think but, at some point
if you want another great leap, a parser is the way to go.
--
Charles DeRykus
------------------------------
Date: Thu, 22 Aug 2013 14:55:28 -0700
From: Charles DeRykus <derykus@gmail.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <kv61dr$709$2@speranza.aioe.org>
On 8/22/2013 2:52 PM, Charles DeRykus wrote:
> ...
>
> You could add /c and take care of even that I think...
>
Nope, /c doesn't help.
--
Charles DeRykus
------------------------------
Date: Thu, 22 Aug 2013 22:53:13 -0700
From: Charles DeRykus <derykus@gmail.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <kv6tdp$1te$1@speranza.aioe.org>
On 8/22/2013 2:52 PM, Charles DeRykus wrote:
> On 8/22/2013 2:01 PM, Rainer Weikusat wrote:
>> Charles DeRykus <derykus@gmail.com> writes:
>> ...
> if you want another great leap, a parser is the way to go.
>
I'm not sure this is the "great leap" but here's a possible parser approach:
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new_from_file( $filename );
foreach my $tag ($root->look_down(sub{1) ) {
while( my($index,$child) = each $tag->content_array_ref ) {
unless ( ref($child) eq "HTML::Element" ) {
$child =~ s/(\d+)/[$1]/ag; # 1replaces no's in words
$tag->splice_content( $index,1,$child );
}
}
}
print $root->as_HTML();
--
Charles DeRykus
------------------------------
Date: Thu, 22 Aug 2013 22:57:46 -0700
From: Charles DeRykus <derykus@gmail.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <kv6tm9$2dd$1@speranza.aioe.org>
On 8/22/2013 10:53 PM, Charles DeRykus wrote:
> ...
> foreach my $tag ($root->look_down(sub{1) ) {
^^^^^^^
foreach my $tag ( $root->look_down(sub{1}) ) {
--
Charles DeRykus
------------------------------
Date: Fri, 23 Aug 2013 00:31:32 -0700 (PDT)
From: fmassion@web.de
Subject: Re: Variable length lookbehind not implemented
Message-Id: <8ea2c5f9-1862-48ee-8eac-1098890027b7@googlegroups.com>
> Also, according to a quick test, using
>
> while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {
> works, too.
Yes it works, but unfortunately I get an error message about "uninitialized value $in"
My test strings (it's bullshit, just to test the expression). In practise I am using chunks of HTML/XML files, i.e. text which cannot be parsed because not all the required tags are in the text.
Test sentences:
2-side slitting 64 scrap box is full <S 64R> Please empty slitting 654 scrap box
Please 345 set Saddle stitcher 2-Side <S 65 R> slitting 1008 scrap box5
2-side slitting 64 scrap box is full <S 64R> Please empty slitting 654 scrap box
Result with "while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) { "
[2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box
Please [345] set Saddle stitcher [2]-Side <S 65 R> slitting [1008] scrap box[5]
[2]-side slitting [64] scrap box is full <S 64R>
Result with while "( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {"
[2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box
Please [345] set Saddle stitcher [2]-Side <S 65 R> slitting [1008] scrap box[5]
Use of uninitialized value $in in print at D:\Perl\test.pl line 18.
Use of uninitialized value $in in print at D:\Perl\test.pl line 18.
[2]-side slitting [64] scrap box is full <S 64R> Please empty slitting [654] scrap box
This is line 18: print $out, $in;
Thus all sentences have been processed as they should have, but there are 2 times an uninitialized value "$in".
------------------------------
Date: Fri, 23 Aug 2013 11:43:35 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <87k3jcvga0.fsf@sapphire.mobileactivedefense.com>
fmassion@web.de writes:
>> Also, according to a quick test, using
>>
>> while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {
>
>> works, too.
>
> Yes it works, but unfortunately I get an error message about
> "uninitialized value $in"
The easiest way to deal with spurious warnings is "don't enable them"
:->. perl does automatic type conversions whenever necessary but some
people are STRONGLY (!!!!) convinced that programmer convenience is a
surefire way to achieve disaster (why these people dabble in perl
instead of 'languages designed to be obnoxious', ie, C++ or Java,
escapes me ...).
Apart from that, there are various more-or-less ugly workarounds.
The
my ($out, $in) = ($1 // '', $2 // '')
would be one.
Some others
------
while ( $text =~ /\G ([^<]+)|(<.*?>) /sgx ) {
if ($1) {
my $out = $1;
$out =~ s/(\d+)/[$1]/g;
print $out;
} else {
print $2;
}
}
------
This matches either a 'free text' sequence or a complete tag and
performs the substitution when the 'free text' match was successful.
------
while ( $text =~ /\G ([^<]+|<.*?>) /sgx ) {
my $out = $1;
$out =~ s/(\d+)/[$1]/g if $out !~ /^</;
print $out;
}
-----
This is essentially the same except that the matched text always ends
up in $1 so the content of that needs to be examined in order to
determine which it was.
-----
for ($text) {
/\G([^<]+)/gc && do {
my $out = $1;
$out =~ s/(\d+)/[$1]/g;
print $out;
redo;
};
/\G(<.*?>)/g && do {
print $1;
redo;
};
}
----
This use for to alias text to $_. It then checks if either a 'free
text' sequence or a complete tag can be found at the current match
position and performs the correct action for each, followed by a
'redo' in order to restart the loop. If neither pattern matched, end
of the input has obviously been reached and the loop (sort of)
terminates.
NB: The first match needs an additional /c to avoid resetting the
match position if it fails. The second one doesn't because if it
fails, the loop will terminate, anyway.
------------------------------
Date: Fri, 23 Aug 2013 11:53:53 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <87fvu0vfsu.fsf@sapphire.mobileactivedefense.com>
Charles DeRykus <derykus@gmail.com> writes:
> On 8/22/2013 2:01 PM, Rainer Weikusat wrote:
>> Charles DeRykus <derykus@gmail.com> writes:
>>
>> [...]
>>
>>> while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
>>> my($out, $in) = ($1 // '', $2 // '');
>>
>> Also according to a quick test I made, a () which matched an empty
>> string (this includes 'optional' ()s which didn't match anything)
>> causes an empty string to be put into the corresponding $n which
>> implies that the $1 // '' is not even useful as workaround for
>> less-than-useful perl runtime warnings.
>>
>
> That's much better. (But, that's why I was careful to use the weasel
> words "quick" and "fragile" when responding :)
>
> And since the html's pedigree is unknown, an un-entified "<" causes
> problems for both:
>
> just a single un-entified < and any no. 1,2,... to \z vanish
Filters are ill-suited for syntax checking because they will produce
garbage output in case of errors.
BTW: Why <.*?> and not <.*>?
------------------------------
Date: Fri, 23 Aug 2013 13:31:30 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Variable length lookbehind not implemented
Message-Id: <87li3stwpp.fsf@sapphire.mobileactivedefense.com>
Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
[...]
> -----
> for ($text) {
> /\G([^<]+)/gc && do {
> my $out = $1;
> $out =~ s/(\d+)/[$1]/g;
> print $out;
> redo;
> };
>
> /\G(<.*?>)/g && do {
This should be
/\G(<.*?>)/gs
so that tags formatted like this
<
hippocampus
>
are also matched.
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 4019
***************************************