[31627] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 2886 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Mar 25 06:09:19 2010

Date: Thu, 25 Mar 2010 03:09:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 25 Mar 2010     Volume: 11 Number: 2886

Today's topics:
        call external script - fork <ron.eggler@gmail.com>
    Re: call external script - fork <ben@morrow.me.uk>
    Re: call external script - fork <ron.eggler@gmail.com>
    Re: Does $^N only refer to capturing groups? sln@netherlands.com
    Re: Perl HTML searching <KBfoMe@realdomain.net>
    Re: Perl HTML searching <tadmc@seesig.invalid>
    Re: Perl HTML searching <ben@morrow.me.uk>
    Re: Perl HTML searching <ben@morrow.me.uk>
    Re: Perl HTML searching <jurgenex@hotmail.com>
    Re: Perl HTML searching sln@netherlands.com
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 24 Mar 2010 15:46:09 -0700 (PDT)
From: cerr <ron.eggler@gmail.com>
Subject: call external script - fork
Message-Id: <59fded72-7966-4ecf-a9e9-9361756e6f91@k4g2000prh.googlegroups.com>

Hi There,

I would like to start one of my perl scripts out of the one i'm
running and it should have it's parallel process and be totally
decoupled from the currently running script. I tried around with
system("/my/other/script.pl ARG1 ARG2"); and exec("/my/other/script.pl
ARG1 ARG2"); with and without & at the back but nothing quite did it
for me the way i imagined that. What am i doing wrongly?

Thanks,
--
roN


------------------------------

Date: Wed, 24 Mar 2010 23:17:16 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: call external script - fork
Message-Id: <s4jq77-lg3.ln1@osiris.mauzo.dyndns.org>


Quoth cerr <ron.eggler@gmail.com>:
> 
> I would like to start one of my perl scripts out of the one i'm
> running and it should have it's parallel process and be totally
> decoupled from the currently running script. I tried around with
> system("/my/other/script.pl ARG1 ARG2"); and exec("/my/other/script.pl
> ARG1 ARG2"); with and without & at the back but nothing quite did it
> for me the way i imagined that. What am i doing wrongly?

What have you tried exactly, and how is it not doing what you want?

system("... &") will allow the command you run to continue after your
main process exits, and the main process will not be notified when it
exits; it is still partially connected to the original process, though,
since it's still in the same process group and still in the same
session. Both of these can be fixed, if necessary, but you need to be
sure they are what is causing your problem.

Ben



------------------------------

Date: Wed, 24 Mar 2010 17:08:45 -0700 (PDT)
From: cerr <ron.eggler@gmail.com>
Subject: Re: call external script - fork
Message-Id: <38eea55f-015a-4d09-95c4-bd3e240061fb@f13g2000pra.googlegroups.com>

On Mar 24, 4:17=A0pm, Ben Morrow <b...@morrow.me.uk> wrote:
> Quoth cerr <ron.egg...@gmail.com>:
>
>
>
> > I would like to start one of my perl scripts out of the one i'm
> > running and it should have it's parallel process and be totally
> > decoupled from the currently running script. I tried around with
> > system("/my/other/script.pl ARG1 ARG2"); and exec("/my/other/script.pl
> > ARG1 ARG2"); with and without & at the back but nothing quite did it
> > for me the way i imagined that. What am i doing wrongly?
>
> What have you tried exactly, and how is it not doing what you want?
>
> system("... &") will allow the command you run to continue after your
> main process exits, and the main process will not be notified when it
> exits; it is still partially connected to the original process, though,
> since it's still in the same process group and still in the same
> session. Both of these can be fixed, if necessary, but you need to be
> sure they are what is causing your problem.

Yeah, I got it going properly now with system("...&"); That in fact
was just a confusion on my side, so nevermind but anyways, thank you
for attempting to help! :)

--
roN



------------------------------

Date: Wed, 24 Mar 2010 19:05:45 -0700
From: sln@netherlands.com
Subject: Re: Does $^N only refer to capturing groups?
Message-Id: <ioflq5di6hjt3boi03gp0uvnd4tdt7sv65@4ax.com>

On Wed, 24 Mar 2010 11:46:05 -0400, Shmuel (Seymour J.) Metz <spamtrap@library.lspace.org.invalid> wrote:

>In the Perl 5.10 documentation of $^N, does "group" refer only to
>capturing groups? The context is that I'd like to write something like
>
> qr/(?:\d\d)
>    (?({$^N > 24})
        ^
     (?(?{$^N > 24})
where (? ..) is a conditional,
      (?{..}) is code

>      (*FAIL)
>    )
>   /x
>
>and have no need to capture the \d\d other than the range check.
>
>Thanks.

You do need a capture group. Putting that aside though,
if you did use a capture group,

my $rx =  qr/(\d\d)
    (?(?{$^N > 24})
      (*FAIL)
    )
   /x;

if ("29921" =~ /$rx/) {
   print "found $1\n";
}

then '299221' would pass with '21' found.
Thats because you didn't qualify \d\d with anything else.
Its better to surround it with an assertion or something else.
  \b(\d+)\b
And, the \d+ is because '000044' is greater than 24 but '00' isin't.


There is nothing wrong with a capture group, just depends what you
are doing with it. But (*FAIL) will cause the regex to fail.
Another thing that causes it to fail is if it matches (\d+), passes
the conditional, but doesen't match anything else

Its doubtfull you could use $rx in another regular expression for
the expressed purpose of checking if it will fail but not capturing it,
so why not just capture it?

So, the 3 states it could be:

OK found '019' <= 24
Error found '00201' > 24
Did not find any number digits

-sln
----------------
use strict;
use warnings;

my ($val, $err);
my $rx =  qr/ \b (\d+) \b
    (?(?{
            $val = $^N;
            $err = 1 if $^N > 24; 
        })
        (*FAIL)
    )
   /x;

for ('019', '00201', 'foobar')
{
    ($val, $err) = ('', 0);
    if ( /$rx/ ) {
        print "OK found '$val' <= 24\n";
    }
    elsif ($err) {
        print "Error found '$val' > 24\n";
    }
    else {
        print "Did not find any number digits\n";
    }
}
__END__



------------------------------

Date: Wed, 24 Mar 2010 17:55:29 -0500
From: "Kyle T. Jones" <KBfoMe@realdomain.net>
Subject: Re: Perl HTML searching
Message-Id: <hoe593$ss3$1@news.eternal-september.org>

Jürgen Exner wrote:
> "Kyle T. Jones" <KBfoMe@realdomain.net> wrote:
>> Tad McClellan wrote:
>>> Kyle T. Jones <KBfoMe@realdomain.net> wrote:
>>>> Steve wrote:
>>>>> like lets say I searched a site
>>>>> that had 15 news links and 3 of them said "Hello" in the title.  I
>>>>> would want to extract only the links that said hello in the title.
>>>> Read up on perl regular expressions.
>>>
>>> While reading up on regular expressions is certainly a good idea,
>>> it is a horrid idea for the purposes of parsing HTML.
>>>
>> Ummm.  Could you expand on that?
>>
>> My initial reaction would be something like - I'm pretty sure *any* 
>> method, including the use of HTML::LinkExtor, or XML transform (both 
>> outlined upthread), involves using regular expressions "for the purposes 
>> of parsing HTML".
> 
> Regular expressions recognize regular languages. But HTML is a
> context-free language and therefore cannot be recognized solely by a
> regular parser. 
> Having said that Perl's extended regular expressions are indeed more
> powerful than regular, but still it is a bad idea because the
> expressions are becoming way to complex. 
> 
>> At best, you're just abstracting the regex work back to the includes. 
>> AFAIK, and feel free to correct me (I'll go take a look at some of the 
>> relevant module code in a bit), every CPAN module that is involved with 
>> parsing HTML uses fairly straightforward regex matching somewhere within 
>> that module's methods.
> 
> Using REs to do _part_ of the work of parsing any language is a
> no-brainer, of course everyone does it e.g. in the tokenizer. 
> 
> But unless your language is a regular language (and there aren't many
> useful regular languages because regular is just too restrictive) you
> need additional algorithms that cannot be expressed as REs to actually
> parse a context-free or context-sensitive language.
> 
>> I think there's an argument that, considering you can do this so easily 
>> (in under 15 lines of code) without the overhead of unnecessary 
>> includes, my way would be more efficient.  We can run some benchmarks if 
>> you want (see further down for working code).
> 
> But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
> Theory of Computer Languages or Basics of Compiler Construction?
> What do people learn in Computer Science today?
>  
> jue

But isn't the Chomsky Hierarchy completely irrelevant in this (forgive 
the pun) context?  Surely you "get" that my input is analyzed in terms 
of being nothing more or less than a sequence of characters - that it 
was originally written in HTML, or any other CFG-based language, is 
meaningless - both syntactical and semantical considerations of that 
original language are irrelevant in the (again, forgive me) context of 
what I'm attempting - which is simply to match one finite sequence of 
characters against another finite sequence of characters - I could care 
less what those characters mean, what href indicates, what a <body> tag 
is, etc.

I don't need to understand English to count the # of e's in the above 
passage, right?  Neither does Perl.

I believe what you say above is true - to truly "parse" the page AS HTML 
is beyond the ability of REs - but I'm not parsing anything AS HTML, if 
that makes sense.  In fact, to take that a step further, I'm not 
"parsing" period - so perhaps it was a mistake for me to use that term. 
  I meant to use the term colloquially, sorry if that caused any confusion.

Cheers.


"	'Regular expressions' [...] are only marginally related to real 
regular expressions. Nevertheless, the term has grown with the 
capabilities of our pattern matching engines, so I'm not going to try to 
fight linguistic necessity here. I will, however, generally call them 
"regexes" (or "regexen", when I'm in an Anglo-Saxon mood)" - Larry Wall


------------------------------

Date: Wed, 24 Mar 2010 18:10:33 -0500
From: Tad McClellan <tadmc@seesig.invalid>
Subject: Re: Perl HTML searching
Message-Id: <slrnhql6no.u8r.tadmc@tadbox.sbcglobal.net>

Kyle T. Jones <KBfoMe@realdomain.net> wrote:
> Tad McClellan wrote:
>> Kyle T. Jones <KBfoMe@realdomain.net> wrote:
>>> Steve wrote:
>> 
>>>> like lets say I searched a site
>>>> that had 15 news links and 3 of them said "Hello" in the title.  I
>>>> would want to extract only the links that said hello in the title.
>>> Read up on perl regular expressions.
>> 
>> 
>> While reading up on regular expressions is certainly a good idea,
>> it is a horrid idea for the purposes of parsing HTML.
>> 
>
> Ummm.  Could you expand on that?


I think the FAQ answer does a pretty good job of it.


> My initial reaction would be something like - I'm pretty sure *any* 
> method, including the use of HTML::LinkExtor, or XML transform (both 
> outlined upthread), involves using regular expressions "for the purposes 
> of parsing HTML".


"pattern matching" is not at all the same as "parsing".

Regular expressions are *great* for pattern matching.

It is mathematically impossible to do a proper parse of a context-free
lanuguage such as HTML with nothing more than regular expressions.

They do not contain the requisite power.

Google for the "Chomsky hierarchy".

HTML allows a table within a table within a table within a table,
to an arbitrary depth. ie. it is not "regular".


> I think there's an argument that, considering you can do this so easily 
> (in under 15 lines of code) without the overhead of unnecessary 
> includes, my way would be more efficient.


Do you want easy and wrong or hard and correct?


> you want (see further down for working code).


You have a strange definition of "working"...


>> Have you read the FAQ answers that mention HTML?
>> 
>>     perldoc -q HTML


Did you try that yet?

It points out at least one way that your code below can fail.


> I think this works fine:


You just haven't used a data set that exposes its flaws.

You are not "parsing", you are "pattern matching".

"pattern matching" is often "good enough", but you should realize
its fragility so that you can assess whether it is worth the ease
of implementation or not.


> #!/usr/bin/perl -w
                  ^^
                  ^^
> use strict;
> use warnings;
      ^^^^^^^^


Turning on warnings 2 times is kind of silly...

Lose the command line switch, lexical warnings are much better.


Try it with this:

-------------------
my $contents = '
<html><body>
<!--
    this is NOT a link...
    <a href="google.com">Google</a>
-->
</body></html>
';
-------------------


It will make output when it should make none.


> my @semiparsed=split(/href/i, $contents);
>
> foreach(@semiparsed){
> 	if($_=~/^\s*=\s*('|")(.*?)('|")/){


Gak!

Whitespace is not a scarce resource, feel free to use as much of it
as you like to make your code easier to read and understand.

Character classes are much more efficient than alternation.

Either be explicit in both places: 

    foreach $_ (
       if ( $_ =~ /...

or in neither:

    foreach (
       if ( /...

be consistent.

So, let's rewrite that line as an experienced Perl programmer might:

    if ( /^\s*=\s*['"](.*?)['"]/ ) { # now link will be in $1 instead of $2


Also, your code does not address the OP's question.

It tests the URL for a string rather than testing the <a> tag's _contents_.

That is, he wanted to test 

    <a href="...">...</a>
                  ^^^
                  ^^^ here

rather than

    <a href="...">...</a>
             ^^^
             ^^^

-- 
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.


------------------------------

Date: Wed, 24 Mar 2010 23:21:41 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Perl HTML searching
Message-Id: <5djq77-lg3.ln1@osiris.mauzo.dyndns.org>


Quoth "Kyle T. Jones" <KBfoMe@realdomain.net>:
> 
> But isn't the Chomsky Hierarchy completely irrelevant in this (forgive 
> the pun) context?  Surely you "get" that my input is analyzed in terms 
> of being nothing more or less than a sequence of characters - that it 
> was originally written in HTML, or any other CFG-based language, is 
> meaningless - both syntactical and semantical considerations of that 
> original language are irrelevant in the (again, forgive me) context of 
> what I'm attempting - which is simply to match one finite sequence of 
> characters against another finite sequence of characters - I could care 
> less what those characters mean, what href indicates, what a <body> tag 
> is, etc.

This is correct, and treating HTML (or whatever) as plain text for the
purposes of grabbing something you want can be a valuable technique.
It's worth being aware that it's basically a hack, though, and that a
problem like 'find all the links in this document' is much better solved
by parsing the HTML properly than by trying to construct a regex to
match all possible forms of <a> tag.

Ben



------------------------------

Date: Wed, 24 Mar 2010 23:40:52 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Perl HTML searching
Message-Id: <4hkq77-l74.ln1@osiris.mauzo.dyndns.org>


Quoth Tad McClellan <tadmc@seesig.invalid>:
> 
> "pattern matching" is not at all the same as "parsing".
> 
> Regular expressions are *great* for pattern matching.
> 
> It is mathematically impossible to do a proper parse of a context-free
> lanuguage such as HTML with nothing more than regular expressions.
> 
> They do not contain the requisite power.
> 
> Google for the "Chomsky hierarchy".
> 
> HTML allows a table within a table within a table within a table,
> to an arbitrary depth. ie. it is not "regular".

Perl's regexen are not regular. With the new features in 5.10 it's easy
to match something like that (it was possible before with (??{}), but
not easy):

    perl -E'"[[][[][]]]" =~ m!(?<nest> \[ (?&nest)* \] )!x 
        and say $+{nest}'
    [[][[][]]]

Building a proper grammar for something like HTML would be harder,
especially if you wanted to keep it readable, but I expect it would be
possible. Certainly something simple that tracked comment/not-comment/
tag/not-tag would not be too hard, and would be sufficient for many
purposes.

> > I think there's an argument that, considering you can do this so easily 
> > (in under 15 lines of code) without the overhead of unnecessary 
> > includes, my way would be more efficient.
> 
> 
> Do you want easy and wrong or hard and correct?

I want easy and correct, so I'll use a module :).

> "pattern matching" is often "good enough", but you should realize
> its fragility so that you can assess whether it is worth the ease
> of implementation or not.

I just quoted that because I think it bears repeating.

Ben



------------------------------

Date: Wed, 24 Mar 2010 17:29:24 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: Perl HTML searching
Message-Id: <cualq5ldsmg4hbgln7n3c2so2gf9nr8kut@4ax.com>

"Kyle T. Jones" <KBfoMe@realdomain.net> wrote:
>Jürgen Exner wrote:
>> "Kyle T. Jones" <KBfoMe@realdomain.net> wrote:
>>> Tad McClellan wrote:
>>>> Kyle T. Jones <KBfoMe@realdomain.net> wrote:
>>>>> Steve wrote:
>>>>>> like lets say I searched a site
>>>>>> that had 15 news links and 3 of them said "Hello" in the title.  I
>>>>>> would want to extract only the links that said hello in the title.
>>>>> Read up on perl regular expressions.
>>>>
>>>> While reading up on regular expressions is certainly a good idea,
>>>> it is a horrid idea for the purposes of parsing HTML.
>>>>
>>> Ummm.  Could you expand on that?
[...]
>> Regular expressions recognize regular languages. But HTML is a
>> context-free language and therefore cannot be recognized solely by a
>> regular parser. 
[...]
>> But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
>> Theory of Computer Languages or Basics of Compiler Construction?
>> What do people learn in Computer Science today?
>
>But isn't the Chomsky Hierarchy completely irrelevant in this (forgive 
>the pun) context?  Surely you "get" that my input is analyzed in terms 
>of being nothing more or less than a sequence of characters - that it 
>was originally written in HTML, or any other CFG-based language, is 
>meaningless - both syntactical and semantical considerations of that 
>original language are irrelevant in the (again, forgive me) context of 
>what I'm attempting - which is simply to match one finite sequence of 
>characters against another finite sequence of characters - I could care 
>less what those characters mean, what href indicates, what a <body> tag 
>is, etc.

True. If you know exactly what format your input can possibly have (and
if that input can be described using a finite state automaton) then by
all means yes, go for it. REs are perfect for such tasks.

But that is not what you have been asking, see the Subject of this
thread.

>I believe what you say above is true - to truly "parse" the page AS HTML 
>is beyond the ability of REs - but I'm not parsing anything AS HTML, if 
>that makes sense.  In fact, to take that a step further, I'm not 
>"parsing" period - so perhaps it was a mistake for me to use that term. 
>  I meant to use the term colloquially, sorry if that caused any confusion.

Well, yes and no. If you are in control of the format and you know
exactly what format is allowed and which formats are not allowed, then
you are right.
But if you are not in control of the input format, e.g. you are reading
from a third-party web page or you get your input data from finance or
marketing or the subsidiary on the opposite side of the world, then your
code must be able to handle any legal HTML because the format could be
changed on you at any time. Which in turn means you must formally parse
the HTML code as HTML code, their is just no way around it.

jue


------------------------------

Date: Wed, 24 Mar 2010 18:36:11 -0700
From: sln@netherlands.com
Subject: Re: Perl HTML searching
Message-Id: <aoalq5hdhsmeaqpr2l78ff5pckngnbk7h2@4ax.com>

On Wed, 24 Mar 2010 23:40:52 +0000, Ben Morrow <ben@morrow.me.uk> wrote:

>
>Quoth Tad McClellan <tadmc@seesig.invalid>:
>> 
>> "pattern matching" is not at all the same as "parsing".
>> 
>> Regular expressions are *great* for pattern matching.
>> 
>> It is mathematically impossible to do a proper parse of a context-free
>> lanuguage such as HTML with nothing more than regular expressions.
>> 
>> They do not contain the requisite power.
>> 
>> Google for the "Chomsky hierarchy".
>> 
>> HTML allows a table within a table within a table within a table,
>> to an arbitrary depth. ie. it is not "regular".
>
>Perl's regexen are not regular. With the new features in 5.10 it's easy
>to match something like that (it was possible before with (??{}), but
>not easy):
>
>    perl -E'"[[][[][]]]" =~ m!(?<nest> \[ (?&nest)* \] )!x 
>        and say $+{nest}'
>    [[][[][]]]
>
     ^^^^^^^^^^
All this shows is balanced character '[' ']' matching using the
recursive ability of the 5.10 engine.

Could this be an example such that each square bracket is a
markup instruction, like <tag> ?
It certainly doesen't pertain the the '<' angle brackets, the
parsing delimeter of the instruction.

There is no compliance in HTML to have closing tags so as embedded
markup ustructions interspersed with content are parsed, a guess is
made, if errors are found, where to discontinue the instruction
as applied to the context. And in general, where the nesting is stopped.

There is a separation between the markup instruction and the content
via the markup delimeter '<'. That is the first level of parsing,
extracting the instruction from its delimeter and thereby the
content. The second level is structuring the markup instruction
within the content. 

When a complete discreet structure is obtained, the document processor
renders it, a chunk at a time, mid-stream.

The first level, separating markup instructions from its delimeter
(and as a side-effect, exposing content) can be done by any language
that can compare characters.

The second level can be done by any language that can do a stack
or nested variables.

There is no place for balanced text processing for the first
level of parsing markup instructions. Instructions within
instructions are NOT well formed and will be kicked out of
processors.

So essentially, as slow as it can be, if the aim is to peal away
delimeters to expose the markup instruction, regular expressions
work great. C processors work about 100 - 500 times faster but
don't have the ability to give extended (look ahead) errors,
nor will they self correct and continue. Most cases, a 
regular expression can identify errant markup instruction syntax
while correctly encapsulating the delimeting expression.
If there is an errant '<' delimeter in content, it is not
well-formed but is still captured as content and easily reported.

Overall, there is no requirement for processors to stop on
not well-formed, but most do because they are full featured
and compliant. Most go out and bring in includes, do substitutions,
reparse, etc.

No, you won't get that with regular expressions, but there
is nothing stopping anybody from using them to parse out
markup instructions and content, nothing at all. Just compare
characters is all you do.

The reason regex is so slow is that it does pattern matching
with backtracking, grouping, etc.

This doesen't mean it can't compare characters, it sure can,
and in a variable way which allows looking ahead which has
benifits over state processing.

As long as the regex takes into account ALL possible markup
instructions and delimeters as exclusionary items, there is
no reason why it can't be used to find specific sub-patterns
either in content or, markup instructions themselves.

And it can drive over and re-align after discrete syntax errors without
stopping. All in all, its a niche parser and perfect at times
when a Dom or SAX is just too cumbersome, too much code overhead
for something simple.

-sln


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2886
***************************************


home help back first fref pref prev next nref lref last post