[32944] in Perl-Users-Digest
Perl-Users Digest, Issue: 4220 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu May 22 18:09:17 2014
Date: Thu, 22 May 2014 15:09:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Thu, 22 May 2014 Volume: 11 Number: 4220
Today's topics:
Re: Help with an operator precedence (?) puzzle <rweikusat@mobileactivedefense.com>
Re: Help with an operator precedence (?) puzzle <rm-dash-bau-haus@dash.futureapps.de>
Re: regular expressions and matching delimeters <ben.usenet@bsb.me.uk>
Re: regular expressions and matching delimeters <PointedEars@web.de>
Re: regular expressions and matching delimeters <PointedEars@web.de>
Re: regular expressions and matching delimeters (hymie!)
Re: regular expressions and matching delimeters <*@eli.users.panix.com>
Re: regular expressions and matching delimeters <rweikusat@mobileactivedefense.com>
Re: regular expressions and matching delimeters <PointedEars@web.de>
Re: regular expressions and matching delimeters <rweikusat@mobileactivedefense.com>
Re: regular expressions and matching delimeters <PointedEars@web.de>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Thu, 22 May 2014 18:31:15 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Help with an operator precedence (?) puzzle
Message-Id: <871tvlaffg.fsf@sable.mobileactivedefense.com>
"G.B." <rm-dash-bau-haus@dash.futureapps.de> writes:
> On 21.05.14 19:53, Rainer Weikusat wrote:
>
> [Algol 68, sh]
>
>> I can repose the nonsensical rethorical question for any number of other
>> groups of people without any relevant qualification but the answer
>> remains always: Every member of $random_group willing to learn how to do
>> it, after having done so.
>
> Then don't do that ;-) (Or rephrase them to be meaningful to
> the discussion.) My question was about empirical evidence of Perl
> programmers spending/wasting time on precedence puzzles (as, e.g.
> this thread, cf. Subject line), and those likely caused by comma.
There was no 'precedence puzzle' in the original posting, that was
basically (paraphrase)
$v = "text" and return if $something;
$v = "text: $value" and return if $something_else;
and the question was why the first line produced a compiler warning
a la 'found = in boolean context did you mean ==?' while the second
didn't.
The answer to that is two-fold:
1. 'and' is a logical operator supposed to test a complex condition by
evaluating its right-hand operand in case evaluating its left-hand
operand resulted in a 'true' value. This property is not really used in
this case because both left-hand operands always evaluate to a 'true'
value and that's what is supposed to happen. But while this is an
obvious assumption for a human, the compiler doesn't know this. What it
knows is that a construct of the form
<variable> = <constant>
appears in a boolean context, hence, it emits a warning based on the
assumption that somebody meant to test the value of $v and not set it
but used = instead of == by accident.
2. There's no warning generated for the second line because the
right-hand operator of the assignment is not constant.
The warning can be avoided by using an operator which provides the
desired sequencing but without the unused 'complex test' property, ie
$v = test, return if $something;
There's no 'precedence puzzle' here, either, because only operators
(appearing within expressions) have a precedence and statement modifiers
are not operators (there cannot ever be a 'precedence question' for a
statement modifier because at most one statement modifier may be part of
a statement). That someone could construct an English sentence a la
Find yourself a girl and get married, return if you can't.
is of no relevance here because Perl is not English.
------------------------------
Date: Thu, 22 May 2014 20:27:05 +0200
From: "G.B." <rm-dash-bau-haus@dash.futureapps.de>
Subject: Re: Help with an operator precedence (?) puzzle
Message-Id: <537e4177$0$6666$9b4e6d93@newsspool3.arcor-online.net>
On 22.05.14 19:31, Rainer Weikusat wrote:
> There was no 'precedence puzzle' in the original posting,
Literally, no, nominally, yes. (I used that as a label.)
> The answer to that is two-fold:
Recommended reading, I'll daringly say, as an example of how
to nail things down, technically (naming the parts of syntax
that need to be named (and known) to resolve misunderstandings).
[...]
> But while this is an
> obvious assumption for a human, the compiler doesn't know this.
> That someone could construct an English sentence a la
>
> Find yourself a girl and get married, return if you can't.
>
> is of no relevance here because Perl is not English.
Thanks for so splendidly joining the two issues! Now I only
need to add that programming is done by someone, not by
the compiler. And, we are back to step one.
Have a good rest!
------------------------------
Date: Thu, 22 May 2014 15:14:59 +0100
From: Ben Bacarisse <ben.usenet@bsb.me.uk>
Subject: Re: regular expressions and matching delimeters
Message-Id: <0.6beef200222969faa310.20140522151459BST.8738g1sxwc.fsf@bsb.me.uk>
Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>> Eli the Bearded <*@eli.users.panix.com> writes:
>> <snip>
>>> Regular expressions can match based on "matching delimeters" but not
>>> on arbritrarily nested "matching delimeters". (What perl's "regular
>>> expressions" can do is more than "regular", but it is ill-advised to
>>> try to code for that madness.)
>>
>> Is it? Can you explain?
>>
>> I had a use-case to parse (and then interpret) a very simple lisp-like
>> language and I thought I'd give Perl's self-referential patterns a try.
>> It turned out to provide a very simple solution.
>
> The given case was somewhat different from that,
Yes, I was not advocating for it in this case. I thought the comment
about madness was general and suggested something I should know about
Perl's supra-regular expressions.
<snip>
> [...] there's exactly
> one (AFAIK) sensible quoting syntax on this planet, namely, the one used
> in HTML, which guarantees that 'special characters' don't appear
> literally inside quoted constructs and whose quoted strings
That's a good point.
> [...] can thus be
> analyzed by looking for the next ", but nobody uses that, likely
> because that would make too much sense.
It matters less in some contexts, which might explain the persistence of
"traditional" quoting in, say, programming languages.
<snip>
--
Ben.
------------------------------
Date: Thu, 22 May 2014 19:48:13 +0200
From: Thomas 'PointedEars' Lahn <PointedEars@web.de>
Subject: Re: regular expressions and matching delimeters
Message-Id: <3309339.NIsjcMIfUD@PointedEars.de>
Justin C wrote:
> [Absolutely nothing worth reading at all.]
I thought so, too.
--
PointedEars
Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
------------------------------
Date: Thu, 22 May 2014 20:25:07 +0200
From: Thomas 'PointedEars' Lahn <PointedEars@web.de>
Subject: Re: regular expressions and matching delimeters
Message-Id: <1964333.Kbm4ZNqMNS@PointedEars.de>
Ben Bacarisse wrote:
> Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
>> [...] there's exactly one (AFAIK) sensible quoting syntax on this planet,
>> namely, the one used in HTML, which guarantees that 'special characters'
>> don't appear literally inside quoted constructs and whose quoted strings
>
> That's a good point.
How so? The JSON grammar is well-defined [1]; it is a subset of the
ECMAScript grammar. The regular expression for JSON string literals
therefore is rather simple and straightforward:
my $json_string = qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;
AISB, it is possible to parse such language with regular expressions; it is
just not (reasonably) possible with only one application of one regular
expression. Indeed, efficient parsers do support and use regular
expressions in their *lexer*.
[1] <http://json.org/>
--
PointedEars
Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
------------------------------
Date: 22 May 2014 18:33:46 GMT
From: hymie@lactose.homelinux.net (hymie!)
Subject: Re: regular expressions and matching delimeters
Message-Id: <537e430a$1$58133$862e30e2@ngroups.net>
In our last episode, the evil Dr. Lacto had captured our hero,
Rainer Weikusat <rweikusat@mobileactivedefense.com>, who said:
>hymie@lactose.homelinux.net (hymie!) writes:
>> var list = [{"item":1,"tags":["tag1","tag2"],"day":"Friday",
>> "people":[{"name":"Joe","id":"1"},{"name":"Larry","id":"2"}],
>> "loc":"Room 100"}, {"item":2,"tags":["tag2","tag3"],"day":"Friday",
>> "people":[{"name":"Joe","id":"1"},{"name":"Tom","id":"3"}],
>> "loc":"Room 101"}];
>This looks very suspiciously like JSON ('Javascript Object
>Notation'). Unsurprisingly, there's a module for dealing with that (the
>first one I found),
Thanks for the tip.
--hymie! http://lactose.homelinux.net/~hymie hymie@lactose.homelinux.net
------------------------------
Date: Thu, 22 May 2014 18:39:58 +0000 (UTC)
From: Eli the Bearded <*@eli.users.panix.com>
Subject: Re: regular expressions and matching delimeters
Message-Id: <eli$1405221431@qz.little-neck.ny.us>
In comp.lang.perl.misc, Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
> Eli the Bearded <*@eli.users.panix.com> writes:
> > Regular expressions can match based on "matching delimeters" but not
> > on arbritrarily nested "matching delimeters". (What perl's "regular
> > expressions" can do is more than "regular", but it is ill-advised to
> > try to code for that madness.)
> Is it? Can you explain?
Two answers from two FAQs, first the older, then the newer.
:r! perldoc-5.8.8 -q balanced
Found in /usr/local/lib/perl5/5.8.8/pod/perlfaq6.pod
Can I use Perl regular expressions to match balanced text?
Historically, Perl regular expressions were not capable of matching
balanced text. As of more recent versions of perl including 5.6.1
experimental features have been added that make it possible to do this.
Look at the documentation for the (??{ }) construct in recent perlre
manual pages to see an example of matching balanced parentheses. Be
sure to take special notice of the warnings present in the manual
before making use of this feature.
CPAN contains many modules that can be useful for matching text
depending on the context. Damian Conway provides some useful patterns
in Regexp::Common. The module Text::Balanced provides a general
solution to this problem.
One of the common applications of balanced text matching is working
with XML and HTML. There are many modules available that support these
needs. Two examples are HTML::Parser and XML::Parser. There are many
others.
An elaborate subroutine (for 7‐bit ASCII only) to pull out balanced and
possibly nested single chars, like ‘ and ’, { and }, or ( and ) can be
found in http://www.cpan.org/authors/id/TOMC/scripts/pull_quotes.gz .
The C::Scan module from CPAN also contains such subs for internal use,
but they are undocumented.
:r! perldoc-5.14.2 -q balanced
Found in /usr/local/lib/perl5/5.14.2/pod/perlfaq6.pod
Can I use Perl regular expressions to match balanced text?
(contributed by brian d foy)
Your first try should probably be the Text::Balanced module, which is
in the Perl standard library since Perl 5.8. It has a variety of
functions to deal with tricky text. The Regexp::Common module can also
help by providing canned patterns you can use.
As of Perl 5.10, you can match balanced text with regular expressions
using recursive patterns. Before Perl 5.10, you had to resort to
various tricks such as using Perl code in (??{}) sequences.
Here’s an example using a recursive regular expression. The goal is to
capture all of the text within angle brackets, including the text in
nested angle brackets. This sample text has two "major" groups: a group
with one level of nesting and a group with two levels of nesting. There
are five total groups in angle brackets:
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that’s it.
The regular expression to match the balanced text uses two new (to Perl
5.10) regular expression features. These are covered in the perlre
manpage and this example is a modified version of one in that
documentation.
First, adding the new possessive + to any quantifier finds the longest
match and does not backtrack. That’s important since you want to handle
any angle brackets through the recursion, not backtracking. The group
< [^<]++ >> finds one or more non‐angle brackets without backtracking.
Second, the new (?PARNO) refers to the sub‐pattern in the particular
capture group given by PARNO. In the following regex, the first capture
group finds (and remembers) the balanced text, and you need that same
pattern within the first buffer to get past the nested text. That’s the
recursive part. The (?1) uses the pattern in the outer capture group as
an independent part of the regex.
Putting it all together, you have:
#!/usr/local/bin/perl5.10.0
my $string =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that’s it.
HERE
my @groups = $string =~ m/
( # start of capture group 1
< # match an opening angle bracket
(?:
[^<>]++ # one or more non angle brackets, non backtracking
⎪
(?1) # found < or >, so recurse to capture group 1
)*
> # match a closing angle bracket
) # end of capture group 1
/xg;
$" = "\n\t";
print "Found:\n\t@groups\n";
The output shows that Perl found the two major groups:
Found:
<brackets in <nested brackets> >
<another group <nested once <nested twice> > >
With a little extra work, you can get the all of the groups in angle
brackets even if they are in other angle brackets too. Each time you
get a balanced match, remove its outer delimiter (that’s the one you
just matched so don’t match it again) and add it to a queue of strings
to process. Keep doing that until you get no matches:
#!/usr/local/bin/perl5.10.0
my @queue =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that’s it.
HERE
my $regex = qr/
( # start of bracket 1
< # match an opening angle bracket
(?:
[^<>]++ # one or more non angle brackets, non backtracking
⎪
(?1) # recurse to bracket 1
)*
> # match a closing angle bracket
) # end of bracket 1
/x;
$" = "\n\t";
while( @queue )
{
my $string = shift @queue;
my @groups = $string =~ m/$regex/g;
print "Found:\n\t@groups\n\n" if @groups;
unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
}
The output shows all of the groups. The outermost matches show up first
and the nested matches so up later:
Found:
<brackets in <nested brackets> >
<another group <nested once <nested twice> > >
Found:
<nested brackets>
Found:
<nested once <nested twice> >
Found:
<nested twice>
Elijah
------
would generally prefer using regexes to match smaller bits and looping
------------------------------
Date: Thu, 22 May 2014 19:48:22 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: regular expressions and matching delimeters
Message-Id: <87wqdd8xah.fsf@sable.mobileactivedefense.com>
Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
> Ben Bacarisse wrote:
>
>> Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
>>> [...] there's exactly one (AFAIK) sensible quoting syntax on this planet,
>>> namely, the one used in HTML, which guarantees that 'special characters'
>>> don't appear literally inside quoted constructs and whose quoted strings
>>
>> That's a good point.
>
> How so? The JSON grammar is well-defined [1]; it is a subset of the
> ECMAScript grammar. The regular expression for JSON string literals
> therefore is rather simple and straightforward:
>
> my $json_string = qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;
In case \-escaping hadn't been used for quoting the delimiter, this could be
reduced to
$json_string = qr/"[^"]*"/
if the purpose was just to analyze Javascript 'object literals'.
------------------------------
Date: Thu, 22 May 2014 20:58:33 +0200
From: Thomas 'PointedEars' Lahn <PointedEars@web.de>
Subject: Re: regular expressions and matching delimeters
Message-Id: <10018990.tzcN7dygDl@PointedEars.de>
Rainer Weikusat wrote:
> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>> Ben Bacarisse wrote:
>>> Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
>>>> [...] there's exactly one (AFAIK) sensible quoting syntax on this
>>>> [planet,
>>>> namely, the one used in HTML, which guarantees that 'special
>>>> characters' don't appear literally inside quoted constructs and whose
>>>> quoted strings
>>>
>>> That's a good point.
>>
>> How so? The JSON grammar is well-defined [1]; it is a subset of the
>> ECMAScript grammar. The regular expression for JSON string literals
>> therefore is rather simple and straightforward:
>>
>> my $json_string =
>> qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;
>
> In case \-escaping hadn't been used for quoting the delimiter, this could
> be reduced to
>
> $json_string = qr/"[^"]*"/
>
> if the purpose was just to analyze Javascript 'object literals'.
Your point being? Even Perl recognizes the need for escape sequences like
\" in string literals. You fail to realize that HTML’s way of "escaping"
has a drawback, too: “&”, and the frequent syntax error of “unrecognized
entity reference” (and the requirement of an error correction in parsers to
cope with that) when the author did not intend an entity reference in the
first place. There is nothing sane about this way either, it is just a
different one.
--
PointedEars
Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
------------------------------
Date: Thu, 22 May 2014 20:52:52 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: regular expressions and matching delimeters
Message-Id: <87mwe98uaz.fsf@sable.mobileactivedefense.com>
Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
> Rainer Weikusat wrote:
>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>> Ben Bacarisse wrote:
>>>> Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
>>>>> [...] there's exactly one (AFAIK) sensible quoting syntax on this
>>>>> [planet,
>>>>> namely, the one used in HTML, which guarantees that 'special
>>>>> characters' don't appear literally inside quoted constructs and whose
>>>>> quoted strings
>>>>
>>>> That's a good point.
>>>
>>> How so? The JSON grammar is well-defined [1]; it is a subset of the
>>> ECMAScript grammar. The regular expression for JSON string literals
>>> therefore is rather simple and straightforward:
>>>
>>> my $json_string =
>>> qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;
>>
>> In case \-escaping hadn't been used for quoting the delimiter, this could
>> be reduced to
>>
>> $json_string = qr/"[^"]*"/
>>
>> if the purpose was just to analyze Javascript 'object literals'.
>
> Your point being?
That should be easy to gather from the text I wrote on this so far.
------------------------------
Date: Thu, 22 May 2014 22:21:54 +0200
From: Thomas 'PointedEars' Lahn <PointedEars@web.de>
Subject: Re: regular expressions and matching delimeters
Message-Id: <1418388.6FbSnrN0ay@PointedEars.de>
Rainer Weikusat wrote:
> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>> Rainer Weikusat wrote:
>>> Thomas 'PointedEars' Lahn <PointedEars@web.de> writes:
>>>> Ben Bacarisse wrote:
>>>>> Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
>>>>>> [...] there's exactly one (AFAIK) sensible quoting syntax on this
>>>>>> [planet, namely, the one used in HTML, which guarantees that 'special
>>>>>> characters' don't appear literally inside quoted constructs and whose
>>>>>> quoted strings
>>>>>
>>>>> That's a good point.
>>>>
>>>> How so? The JSON grammar is well-defined [1]; it is a subset of the
>>>> ECMAScript grammar. The regular expression for JSON string literals
>>>> therefore is rather simple and straightforward:
>>>>
>>>> my $json_string =
>>>> qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;
>>>
>>> In case \-escaping hadn't been used for quoting the delimiter, this
>>> could be reduced to
>>>
>>> $json_string = qr/"[^"]*"/
>>>
>>> if the purpose was just to analyze Javascript 'object literals'.
>>
>> Your point being?
>
> That should be easy to gather from the text I wrote on this so far.
But it is not easy because you are actually not making a point. You have
only provided a not very convincing argument for your humble opinion.
Programming languages are different from markup languages, and so are their
escape mechanisms. I have explained to you why the HTML way is not “[the]
one sensible quoting on this planet”, why it is _not_ better than the
ECMAScript/Perl way /per se/; it is just – in your words – a different form
of senselessness.
If in your formal language string values must be delimited by a non-
whitespace character (YAML e.g. is different), you have only one out of
choices:
One, not to allow delimiters within the delimited string at all, thereby
severely limiting the string values that can be expressed in your language.
Two, to allow for delimiters within the delimited string to be escaped in an
escape sequence that contains the delimiter (simplest case: preceded by
another character, say backslash) if they should lose their special meaning.
Three, to provide an escape sequence for the delimiter that does not contain
the delimiter. HTML and XML implement this one with the entity reference
“&…;” (whereas the trailing “;” has been made optional in HTML).
Now, the problems with quoting by entity reference are just not as obvious
as with quoting by prefix character. Here is an example to make it obvious
to you, hopefully:
<a href="/?foo=bar&baz=bla">…</a>
is a *syntax error" in HTML because “&baz” is an “unknown entity reference”.
But the author did not intend an entity reference in the first place, they
just wanted to delimit parts of the query-part of the URI-reference with
“&”. They can work around this issue if they are aware of the error (for
example, through <http://validator.w3.org/>):
<a href="/?foo=bar&baz=bla">…</a>
But if they are not, parsers would have to work around the problem; they
would have to check against a table of entities in order to determine that
the syntactical entity reference could not reasonably have been intended to
be such one. And as HTML parsers in particular are built for backwards-
compability and robustness, and they do just that, the seemingly more simple
approach of not allowing delimiters within the escape sequence quickly
becomes more complicated for parsing than most people realize.
> BTW: Antwort zwecklos.
Wanting to ignore reality is your problem, not mine.
--
PointedEars
Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 4220
***************************************