[24243] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 6434 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Apr 20 18:11:01 2004

Date: Tue, 20 Apr 2004 15:10:13 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 20 Apr 2004     Volume: 10 Number: 6434

Today's topics:
    Re: regular expression module? <pkent77tea@yahoo.com.tea>
    Re: regular expression module? (Walter Roberson)
    Re: Request for program test on different operating syt <edgrsprj@ix.netcom.com>
    Re: Request for program test on different operating syt <pkent77tea@yahoo.com.tea>
    Re: slurp not working? ideas please! <geoffacox@dontspamblueyonder.co.uk>
    Re: slurp not working? ideas please! <geoffacox@dontspamblueyonder.co.uk>
    Re: slurp not working? ideas please! <geoffacox@dontspamblueyonder.co.uk>
    Re: slurp not working? ideas please! <geoffacox@dontspamblueyonder.co.uk>
    Re: slurp not working? ideas please! <tassilo.parseval@rwth-aachen.de>
    Re: Writing fast(er) performing parsers in Perl <clint@0lsen.net>
    Re: Writing fast(er) performing parsers in Perl <uri.guttman@fmr.com>
    Re: XML::Xerces questions <tadmc@augustmail.com>
    Re: XML::Xerces questions <pkent77tea@yahoo.com.tea>
    Re: XML::Xerces questions <apollock11@hotmail.com>
    Re: XML::Xerces questions <jgibson@mail.arc.nasa.gov>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 20 Apr 2004 20:02:45 +0100
From: pkent <pkent77tea@yahoo.com.tea>
Subject: Re: regular expression module?
Message-Id: <pkent77tea-9AB303.20023020042004@pth-usenet-02.plus.net>

In article <c63ivi$5il$1@canopus.cc.umanitoba.ca>,
 roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote:

<snip interesting stuff about regexes>

> 'yacc' is a well known program that does all the work of building
> up compiled DFA, complete with some kind of state-table compression
> so that the result is both space and time efficient. But yacc produces
> C code. I seem to recall hearing of a yacc implimentation that
> had an option for spitting out perl. Or was that a 'lex' program?
> I do not remember clearly.

Would you be thinking of yapp? As in:

http://search.cpan.org/~fdesar/Parse-Yapp-1.05/lib/Parse/Yapp.pm

specifically it says "The script yapp is a front-end to the Parse::Yapp 
module and let you easily create a Perl OO parser from an input grammar 
file." and "Parse::Yapp should compile a clean yacc grammar without any 
modification..."

P

-- 
pkent 77 at yahoo dot, er... what's the last bit, oh yes, com
Remove the tea to reply


------------------------------

Date: 20 Apr 2004 20:24:15 GMT
From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)
Subject: Re: regular expression module?
Message-Id: <c640tf$nn9$1@canopus.cc.umanitoba.ca>

In article <pkent77tea-9AB303.20023020042004@pth-usenet-02.plus.net>,
pkent  <pkent77tea@yahoo.com.tea> wrote:
:In article <c63ivi$5il$1@canopus.cc.umanitoba.ca>,
: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote:

:<snip interesting stuff about regexes>

:> 'yacc' is a well known program that does all the work of building
:> up compiled DFA, complete with some kind of state-table compression
:> so that the result is both space and time efficient. But yacc produces
:> C code. I seem to recall hearing of a yacc implimentation that
:> had an option for spitting out perl. Or was that a 'lex' program?
:> I do not remember clearly.

:Would you be thinking of yapp? As in:

:http://search.cpan.org/~fdesar/Parse-Yapp-1.05/lib/Parse/Yapp.pm

I was not, but thank you for the reference. I seem to recall seeing
[years ago] a version of yacc with an explicit perl option. I might
be able to find it again if I looked around.

But getting back to my original question: although Yapp does appear to
have a lot of functionality useful for my purposes, the module that
I am remembering as having seen mentioned, was, as best I recall,
just an RE (as opposed to regex) compiler and FSM driver, without the
full LALR parsing facilities of yacc/yapp. Useful if, for example,
one had a large number of simple RE's (or just plain strings) to match
against. join('|', @wordlist) is not very efficient to match against
in perl, even if one qr's it, as perl backtracks because it assumes
later portions of the expression might require the full regexp power.

Maybe I should be looking more closely at the regexp optimizer... but
boy does it produce ugly expressions!  ;-)
-- 
Can a statement be self-referential without knowing it?


------------------------------

Date: Tue, 20 Apr 2004 18:49:45 GMT
From: "edgrsprj" <edgrsprj@ix.netcom.com>
Subject: Re: Request for program test on different operating sytsems
Message-Id: <dhehc.16434$l75.9247@newsread2.news.atl.earthlink.net>

"Tom" <t@REMOVETHISbrowse.to> wrote in message
news:c6359b$8if$1@sparta.btinternet.com...

> Are those print lines supposed to be the same?

No.  You are correct on that.  I had the Windows 98 command correct until I
made a last minute change which somehow eliminated the right one.  It should
be:

print file 'c:\windows\progman.exe '.$fileresults, "\n";

However, as I said in another note, Although that command appears to work on
my system, I cannot get it to produce completely satisfactory results.

The goal here is to have the Perl program tell whatever operating system you
are using to use one of its standard text editors to open the results.out
text file when it is done with a run.  That is easy to do with Windows XP on
a regular PC.  In fact I would say that the results are more than
satisfactory.  The process works great with all types of possible options.
But I have not yet been able to find a good way to do that with Windows 98
on a regular PC.  And I have no idea how to get it to work on any other
types of system.  Hence my original request for other people to give it a
try.

A Perl program test needs to be run.  People can try different types of
commands and see if they can find one which will tell their operating
systems to have a regular text editor open some text file.  Then if they let
me know what that command is and what type of system they are using I will
try to store a copy of the command as an option in this scientific program I
am starting to circulate.


> FWIW start.exe is not a file on either my Win98 or Win2000
> I doubt if it is in XP

> it should be something like:    %COMPSPEC% /c start ....

> FYI on my system, that's     E:\WINNT5\system32\CMD.EXE
> (No C: Drive!)

I can't find any of those files or directories on either my regular Windows
XP or backup Windows 98 PC computers.  Different systems must use different
types of directory structures.  I have never used Windows on anything other
than a standard IBM type PC so I don't know too much about this.

At least with Windows it appears to me to be a little difficult to have any
type of batch file direct a Windows program such as Notepad.exe to start
running.  DOS programs seem to be easy to start from batch files or Perl
programs.




------------------------------

Date: Tue, 20 Apr 2004 20:09:41 +0100
From: pkent <pkent77tea@yahoo.com.tea>
Subject: Re: Request for program test on different operating sytsems
Message-Id: <pkent77tea-59F3E6.20094120042004@pth-usenet-02.plus.net>

In article <slrnc8945p.5cs.tadmc@magna.augustmail.com>,
 Tad McClellan <tadmc@augustmail.com> wrote:

<someone posted code which wrapped>

> You should refactor your code so that there _are no_ long lines,
> then word wrap will never be a concern.

Not that this is hugely relevant, but one of the many things I do like 
about perl is that you can break statements across multiple lines quite 
happily. No backslashes, no continuation marker, no "physical line 
equals logical line", no tedious mucking about.

NB: I didn't see the orignal bit of code though.

#!perl
print
"hello".
" ".
"world
"
;
__END__


P // no I'm not advocating that you all write code one word to a line

-- 
pkent 77 at yahoo dot, er... what's the last bit, oh yes, com
Remove the tea to reply


------------------------------

Date: Tue, 20 Apr 2004 18:33:01 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: slurp not working? ideas please!
Message-Id: <3vqa80t2rc2e1tm0nj5gqb57b8pjcc40sn@4ax.com>

On Tue, 20 Apr 2004 12:47:58 -0500, "J. Gleixner"
<glex_nospam@qwest.invalid> wrote:


>Could also use "perltidy" which does a pretty good job.

thanks

>$next3 is 'undef'ined.
>
>What are the 4 lines of allphp2.php following the line with the 
>$pattern?  Compare them to the values of $curr, $next1, $next2, and 
>$next3.  Do they match?
>
>Possibly, $pattern isn't found, and the "last" is never performed, 
>probably want a flag there.
>
>my $found;
>while (<INNN>) {
>	if (/$pattern/) { $found=1; last; }
>}
>
>return if !$found;  # Or print some error message..
>my ( $curr, $next1, $next2, $next3 ) = <INNN>;

will try this out. you will see from another of my posts that the code
in sub classroomnotes works when in a simplified script, on its own,
but not when in the full script. I have given both in that post. So,
it seems that the lines are there to be found! Very odd and no doubt a
simple explanation is out there!

Cheers

Geoff


>
>Or possibly you're off by one line.
>
>If $pattern is found, then the second <INNN> starts with the line after 
>the line with the match.  Meaning, $curr would be the line after the 
>match.  Guessing by the names of your variables, maybe you want:
>
>my ($next1, $next2, $next3 ) = <INNN>;
>
>Also, you don't need the ()'s for print.



------------------------------

Date: Tue, 20 Apr 2004 18:35:45 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: slurp not working? ideas please!
Message-Id: <56ra80lelmsougljb2fldsln2msr9i45tn@4ax.com>

On 20 Apr 2004 16:14:45 GMT, "Tassilo v. Parseval"
<tassilo.parseval@rwth-aachen.de> wrote:

>Also sprach Anno Siegel:
>
>> Geoff Cox  <geoffacox@dontspamblueyonder.co.uk> wrote in comp.lang.perl.misc:
>
>>> I am quite prepared to admit that the code is not very well written
>>> but apart from this particular problem. it does work. I have left out
>>> large parts of the code which do work ...
>> 
>> Then show your code as it is now.  A sub that defines (non-anonymous)
>> subs in its body is so much off kilter, it's impossible to guess what
>> it should and shouldn't do.
>
>Actually, the code doesn't define functions inside others. The indenting
>merely suggests it does. :-) The code is probably a bit better than it
>looks on first sight (after all, it was in major parts written by me in
>a previous thread;-).

Tassilo

I very much lijke the way you put that! I am reading perldoc perlstyle
and will make some effort there. Do you use any particular editor?

You can see from one of my posts here that the same code works in the
simplified script but not in the full script. Odd?! 

Cheers

Geoff



>To the OP: Please fix the indenting first (just as Uri has told you). As
>it currently is, it is deliberately misleading its readers. Maybe this
>will already help you to spot the problem yourself.
>
>Tassilo



------------------------------

Date: Tue, 20 Apr 2004 20:41:13 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: slurp not working? ideas please!
Message-Id: <qd2b80pqasm0inb1om5sqq7p5fhta1nfdb@4ax.com>

On Tue, 20 Apr 2004 08:19:32 -0500, Tad McClellan
<tadmc@augustmail.com> wrote:

Tad,

I have used perltody programme so hoep this looks OK?

The code in the first section below works OK in that it does find the
links to the full docs. The same code in the second section does not
work - no value is found for $next3 in the sub classroomnotes....does
this improved layout make it easier for you to see why?

Cheers

Geoff


-------------------------
my $pattern = "docs/aslevel/classroom-notes/finance/finance";

open (INNN, "d:/a-keep9/short-nondb/allphp/allphp2.php");
open (OUT, ">>d:/a-keep9/short-nondb/short/members2/test.htm");


while (<INNN>){
    last if /$pattern/;
    }
my ($curr, $next1, $next2, $next3) = <INNN>;
    print ("$curr - $next1 - $next2 - $next3 \n");
    close (INNN);

    if ($next3 =~ /\$i\<(\d+);/) {

    my $nn = $1;

   print ("\$nn = $nn \n");
    
    print OUT ("<td valign='top'> \n");
    for ($c=1;$c<$nn;$c++) {
    print OUT ('<a href="'. $pattern . "-doc" . $c . ".zip" . '">' .
"Document$c" . "</a><br>" . "\n");
    }     
    print OUT ("</td></tr>\n");
                                 }

------------------------


package MyParser;
use base qw(HTML::Parser);
use strict;

my $in_heading;
my $p;

my $name = "as-left.htm";
#open (IN, "d:/a-keep9/short-nondb/oldshort2/$name") ||
#die "cannot open d:/a-keep9/short-nondb/oldshort2/$name \n";
open (OUT, ">>d:/a-keep9/short-nondb/short/members2/$name") || die
"cannot open >>d:/a-keep9/short-nondb/short/members2/$name: $! \n";


print OUT ("<html><head><title>test</title></head><body> \n");
print OUT ("<table width='100%' border='1'> \n");

sub start {

        my ($self, $tagname, $attr, undef, $origtext) = @_;

	if ($tagname eq 'h2') {
	    $in_heading = 1;
	    return;
                              }

        if ($tagname eq 'p') {
            $p = 1;
	    return;
                             }
       
         if ($tagname eq 'option') {

           choice($attr->{ value });

           print ("\$attr etc = $attr->{ value } \n");

                                                     }
        
        }

        sub end         {
        my ($self, $tagname, $origtext) = @_;
	if ($tagname eq 'h2') {
	    $in_heading = 0;
	    return;
                              }


         if ($tagname eq 'p') {
            $p = 0;
	    return;
                              }
                        }
    
    sub text       {
        my ($self, $origtext) = @_;
        print OUT ("<h2>$origtext</h2> \n") if $in_heading;
        print OUT ("<p>$origtext</p> \n") if $p;

                   }

sub choice {
my ($path) = @_;
 
if ($path =~ /docs\/aslevel\/classroom-notes/) {
  intro($path);
  aslevelclassroomnotes($path);
                                               } 

           }

sub intro {

my ($pathhere) = @_;

open (INN, "d:/a-keep9/short-nondb/db/total-160404.txt") || die
"cannot open d:/a-keep9/short-nondb/db/total-160404.txt: $! \n";

my $lineintro;

    while (defined ($lineintro = <INN>)) {
           if ($lineintro =~ /$pathhere','(.*?)'\)\;/) {
           print OUT ("<tr><td>$1 <p> </td>\n");
           }
    }
}



sub aslevelclassroomnotes {

my ($pattern) = @_;
my $c;
my $line;

#print ("\$pattern has value $pattern \n");

open (PHP, "d:/a-keep9/short-nondb/allphp/allphp2.php") || die "cannot
open d:/a-keep9/short-nondb/allphp/allphp2.php \n";

while (<PHP>){
# print "  eof()=", eof() ? "true\n" : "false\n";
 print ("\$pattern has value $pattern \n");
#    $line = $_;
#    print ("\$line = $_ \n");
    last if /$pattern/;
             }
my ($curr, $next1, $next2, $next3) = <PHP>;
print ("curr is $curr next1 is $next1 next2 is $next2 next3 is $next3
\n");
    close (PHP);

    if ($next3 =~ /\$i\<(\d+);/) {
    my $nn = $1;
    print OUT ("<td valign='top'> \n");
        for ($c=1;$c<$nn;$c++) {
        print OUT ('<a href="'. $pattern . "-doc" . $c . ".zip" . '">'
 . "Document$c" . "</a><br>" . "\n");
        }     
        print OUT ("</td></tr>\n");
        }
    }


package main;
open (IN, "d:/a-keep9/short-nondb/oldshort2/$name") || die "cannot
open package main d:/a-keep9/short-nondb/oldshort2/$name: $! \n";
undef $/;
my $html = <IN>;
my $parser = MyParser->new;
$parser->parse($html);

open (OUT, ">>d:/a-keep9/short-nondb/short/members2/$name");
print OUT ("</tr></table> \n");
print OUT ("</body></html> \n");





------------------------------

Date: Tue, 20 Apr 2004 20:50:16 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: slurp not working? ideas please!
Message-Id: <623b80lktubsr6pc78jkobe7qsmc07jqjm@4ax.com>

On Tue, 20 Apr 2004 20:41:13 GMT, Geoff Cox
<geoffacox@dontspamblueyonder.co.uk> wrote:

>The code in the first section below works OK in that it does find the
>links to the full docs. The same code in the second section does not
>work - no value is found for $next3 in the sub classroomnotes....does
>this improved layout make it easier for you to see why?

Good grief ! I posted the wrong code layout - the following was
produced by perltidy ....

Geoff

-------------------------
my $pattern = "docs/aslevel/classroom-notes/finance/finance";

open( INNN, "d:/a-keep9/short-nondb/allphp/allphp2.php" );
open( OUT,  ">>d:/a-keep9/short-nondb/short/members2/test.htm" );

while (<INNN>) {
    last if /$pattern/;
}
my ( $curr, $next1, $next2, $next3 ) = <INNN>;
print("$curr - $next1 - $next2 - $next3 \n");
close(INNN);

if ( $next3 =~ /\$i\<(\d+);/ ) {

    my $nn = $1;

    print("\$nn = $nn \n");

    print OUT ("<td valign='top'> \n");
    for ( $c = 1 ; $c < $nn ; $c++ ) {
        print OUT (   '<a href="' . $pattern . "-doc" . $c . ".zip" .
'">'
                    . "Document$c"
                    . "</a><br>"
                    . "\n" );
    }
    print OUT ("</td></tr>\n");
}

-----------------------------------------------------------

package MyParser;
use base qw(HTML::Parser);

use strict;

my $in_heading;
my $p;

my $name = "as-left.htm";

#open (IN, "d:/a-keep9/short-nondb/oldshort2/$name") ||
#die "cannot open d:/a-keep9/short-nondb/oldshort2/$name \n";
open( OUT, ">>d:/a-keep9/short-nondb/short/members2/$name" )
  || die "cannot open >>d:/a-keep9/short-nondb/short/members2/$name:
$! \n";

print OUT ("<html><head><title>test</title></head><body> \n");
print OUT ("<table width='100%' border='1'> \n");

sub start {

    my ( $self, $tagname, $attr, undef, $origtext ) = @_;

    if ( $tagname eq 'h2' ) {
        $in_heading = 1;
        return;
    }

    if ( $tagname eq 'p' ) {
        $p = 1;
        return;
    }

    if ( $tagname eq 'option' ) {

        choice( $attr->{value} );

        print("\$attr etc = $attr->{ value } \n");

    }

}

sub end {
    my ( $self, $tagname, $origtext ) = @_;
    if ( $tagname eq 'h2' ) {
        $in_heading = 0;
        return;
    }

    if ( $tagname eq 'p' ) {
        $p = 0;
        return;
    }
}

sub text {
    my ( $self, $origtext ) = @_;
    print OUT ("<h2>$origtext</h2> \n") if $in_heading;
    print OUT ("<p>$origtext</p> \n")   if $p;

}

sub choice {
    my ($path) = @_;

    if ( $path =~ /docs\/aslevel\/classroom-notes/ ) {
        intro($path);
        aslevelclassroomnotes($path);
    }

}

sub intro {

    my ($pathhere) = @_;

    open( INN, "d:/a-keep9/short-nondb/db/total-160404.txt" )
      || die "cannot open d:/a-keep9/short-nondb/db/total-160404.txt:
$! \n";

    my $lineintro;

    while ( defined( $lineintro = <INN> ) ) {
        if ( $lineintro =~ /$pathhere','(.*?)'\)\;/ ) {
            print OUT ("<tr><td>$1 <p> </td>\n");
        }
    }
}

sub aslevelclassroomnotes {

    my ($pattern) = @_;
    my $c;
    my $line;

    #print ("\$pattern has value $pattern \n");

    open( PHP, "d:/a-keep9/short-nondb/allphp/allphp2.php" )
      || die "cannot open d:/a-keep9/short-nondb/allphp/allphp2.php
\n";

    while (<PHP>) {

        # print "  eof()=", eof() ? "true\n" : "false\n";
        print("\$pattern has value $pattern \n");

        #    $line = $_;
        #    print ("\$line = $_ \n");
        last if /$pattern/;
    }
    my ( $curr, $next1, $next2, $next3 ) = <PHP>;
    print("curr is $curr next1 is $next1 next2 is $next2 next3 is
$next3 \n");
    close(PHP);

    if ( $next3 =~ /\$i\<(\d+);/ ) {
        my $nn = $1;
        print OUT ("<td valign='top'> \n");
        for ( $c = 1 ; $c < $nn ; $c++ ) {
            print OUT (   '<a href="' . $pattern . "-doc" . $c .
".zip" . '">'
                        . "Document$c"
                        . "</a><br>"
                        . "\n" );
        }
        print OUT ("</td></tr>\n");
    }
}

package main;
open( IN, "d:/a-keep9/short-nondb/oldshort2/$name" )
  || die "cannot open package main
d:/a-keep9/short-nondb/oldshort2/$name: $! \n";
undef $/;
my $html   = <IN>;
my $parser = MyParser->new;
$parser->parse($html);

open( OUT, ">>d:/a-keep9/short-nondb/short/members2/$name" );
print OUT ("</tr></table> \n");
print OUT ("</body></html> \n");




------------------------------

Date: 20 Apr 2004 21:29:05 GMT
From: "Tassilo v. Parseval" <tassilo.parseval@rwth-aachen.de>
Subject: Re: slurp not working? ideas please!
Message-Id: <c644n1$7p031$1@ID-231055.news.uni-berlin.de>

Also sprach Geoff Cox:

> On 20 Apr 2004 16:14:45 GMT, "Tassilo v. Parseval"
><tassilo.parseval@rwth-aachen.de> wrote:
> 
>>Also sprach Anno Siegel:
>>
>>> Geoff Cox  <geoffacox@dontspamblueyonder.co.uk> wrote in comp.lang.perl.misc:
>>
>>>> I am quite prepared to admit that the code is not very well written
>>>> but apart from this particular problem. it does work. I have left out
>>>> large parts of the code which do work ...
>>> 
>>> Then show your code as it is now.  A sub that defines (non-anonymous)
>>> subs in its body is so much off kilter, it's impossible to guess what
>>> it should and shouldn't do.
>>
>>Actually, the code doesn't define functions inside others. The indenting
>>merely suggests it does. :-) The code is probably a bit better than it
>>looks on first sight (after all, it was in major parts written by me in
>>a previous thread;-).
> 
> Tassilo
> 
> I very much lijke the way you put that! I am reading perldoc perlstyle
> and will make some effort there. Do you use any particular editor?

Well, yes, I do. But the editor used is a weak excuse for formatting
code poorly. If you cannot be bothered to do the indenting yourself, get
an editor that does it for you automatically. I use vim that does most
of the indenting for me. Others (like emacs) can do it as well.

Having said that, it took me one key-stroke to re-indent your code and
immediately realize that Anno was in fact right. There are nested
subroutine definitions. I didn't see them in your raw posting because
you managed to hide them by not indenting the relevant part.

I now see things that I didn't see before and honestly I find them
frightening:

    package MyParser;
    use base qw(HTML::Parser);
    use File::Find;

    my $in_heading;
    my $p;

    my $dir = ("d:/a-keep9/short-nondb/oldshort2");

    find sub {
	my $name = $_;
	open (OUT, ">>d:/a-keep9/short-nondb/short/members2/$name");
	print OUT ("<html><head><title>test</title></head><body> \n");
	print OUT ("<table width='100%' border='1'> \n");

	sub start { ... }
	sub end { ... }
	...

	package main;
	my $parser = MyParser->new;
	$parser->parse($html);
    }, $dir;

Think about this structure for a while. If you are able to explain why
you define the HTML::Parser callbacks inside the File::Find::find()
subroutine (which is, after all, triggered for each found file) then,
and only then, you will get my blessing for the above.

There are other dreadful things to see there, notably the switching to
package main inside the function reference.

The whole thing got a bit out of your hands, I am afraid. You have an
iterative task (namely parsing a bunch of files). Furthermore you have a
class (your HTML parser). A class is an abstract description...you only
define it once. Later you may create as many objects as you want that
all conform to the description given by the class. 

But the class itself only exists once. By its nature, it's the opposite
of iterative. That means the first thing to do is move the whole
File::Find stuff out of MyParser. It's just wrong there.

So the outline of your script should look like this:

    package MyParser;
    use base qw/HTML::Parser/;

    # global variables 
    my ($in_heading, $in_p);
    ...

    # handlers
    sub reset { ($in_heading, $in_p) = (0, 0) }
    sub start { ... }
    sub end   { ... }
    ...

    package main;
    
    use File::Find;
    
    my $dir = "....";
    my $parser = MyParser->new;

    find sub {
        $parser->parse_file($_);
        $parser->reset;
    } => $dir;

Ideally, your parser class doesn't even know what kind of task it is
used for. All it does is providing the infrastructure and facilities for
parsing HTML. That's it. Whether you pass one file or thousands of
files...that's not your parser class' business at all.

The iteration of the files is done with the actual object. You create
one parser. HTML::Parser is one of those cases where the object can be
reused; in other scenarios and classes you would create one object per
"problem instance" (in your case the "problem instance" is the file you
want to parse).

So you have this one parser and iteratively have it parse one file after
the other. That's what the above skeleton does. In MyParser the abstract
concept of parsing is defined once. In the main package you apply this
abstract concept to many files. This happens in these four lines:

    find sub {
        $parser->parse_file($_);
        $parser->reset;
    } => $dir;

I am quite sure your program will magically start working once you
change it accordingly to what I've written.

Tassilo
-- 
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval


------------------------------

Date: Tue, 20 Apr 2004 18:25:26 GMT
From: Clint Olsen <clint@0lsen.net>
Subject: Re: Writing fast(er) performing parsers in Perl
Message-Id: <slrnc8aqoi.rg7.clint@poly.0lsen.net>

On 2004-04-20, Walter Roberson <roberson@ibd.nrc-cnrc.gc.ca> wrote:
>
> Not if your grammar happens to be line-oriented (as has been the case for
> the parsing I've been doing.) In your approach, you have the overhead of
> tokenizing -everything-, but in line-oriented grammars there may be
> portions of the line that can be ignored (at least until the context
> calls upon them.)

Yes, I did mention that I was parsing a freeform language, not
line-oriented data.  In your case you are combining (or blurring the lines
between) lexing and parsing - and you're doing mostly lexing since you're
just separating fields.  Since you are guaranteed a complete command/phrase
on a single line, your approach makes sense.  

However, I don't have that luxury, which is why I was inquiring as to why
s///o is so much faster than m/\Gstuff/ogc, and why using read() is
considerably faster than: local $/ = undef; $data = <FILE>.

Now that I've made my changes, the lexing phase is still the largest chunk
of time at around 26%.  This is expected, but it's not anywhere near 70%
that it was before.

-Clint


------------------------------

Date: 20 Apr 2004 14:31:06 -0400
From: Uri Guttman <uri.guttman@fmr.com>
Subject: Re: Writing fast(er) performing parsers in Perl
Message-Id: <siscvfju5wnp.fsf@tripoli.fmr.com>

>>>>> "CO" == Clint Olsen <clint@0lsen.net> writes:


  CO> However, I don't have that luxury, which is why I was inquiring as
  CO> to why s///o is so much faster than m/\Gstuff/ogc, and why using
  CO> read() is considerably faster than: local $/ = undef; $data =
  CO> <FILE>.

read is faster since it doesn't have to check for EOF in a loop like the
slurp does. and sysread is even faster than read. and you can get that
speed with cleaner code by using File::Slurp which does fast sysread
under the hood and has many useful options.

s/// is faster than m//g because it stays inside perl guts (c code) all
the time while m// has to be wrapped in a slower perl level loop.

also the /o is almost never needed anymore. perl will only recompile a
regex when it sees that an interpolated variable has changed since the
last time it was compiled.


  CO> Now that I've made my changes, the lexing phase is still the
  CO> largest chunk of time at around 26%.  This is expected, but it's
  CO> not anywhere near 70% that it was before.

try the file::slurp module and you should get even more speedup.

uri


------------------------------

Date: Tue, 20 Apr 2004 14:34:30 -0500
From: Tad McClellan <tadmc@augustmail.com>
Subject: Re: XML::Xerces questions
Message-Id: <slrnc8auq6.9lr.tadmc@magna.augustmail.com>

Arvin Portlock <apollock11@hotmail.com> wrote:

> 1. The way to get validation errors seems incredibly odd
> to me:
> 
> eval {$parser->parse ($file)};
> print $@;
> 
> Is this the only way to get at error messages? Via $@?
> Does this wrapper provide a more direct method? Does this
> seem odd to anybody else in the perl community or is
> it just me?


   perldoc -f eval

      ...
      It is also Perl's exception trapping mechanism


"eval BLOCK" and "if $@" is Perl's "try" and "catch" mechanism.


-- 
    Tad McClellan                          SGML consulting
    tadmc@augustmail.com                   Perl programming
    Fort Worth, Texas


------------------------------

Date: Tue, 20 Apr 2004 20:34:44 +0100
From: pkent <pkent77tea@yahoo.com.tea>
Subject: Re: XML::Xerces questions
Message-Id: <pkent77tea-DCE35A.20344420042004@pth-usenet-02.plus.net>

In article <c63m7q$kt9$1@agate.berkeley.edu>,
 Arvin Portlock <apollock11@hotmail.com> wrote:

> 1. The way to get validation errors seems incredibly odd
> to me:
> 
> eval {$parser->parse ($file)};
> print $@;

It looks like parse() throws a fatal error, i.e. a die(), when it hits 
an error. A die() will basically exit the program unless you catch the 
exception in an eval block. And the thing that was caught in the eval 
block is held in the special variable $@.

Now, sometimes the sensible thing to do when you encounter an 
unrecoverable error is to throw a fatal exception... sometimes it's 
sensible to return 'undef' and allow the caller to interrogate the 
object using a method such as lastError() or something... or maybe some 
other approach.

Sometimes the user and the module-writer have different ideas, and you 
end up thinking "this is a stupid way to detect an error in an XML 
document".

One underused (IME) feature of perl >5.005 are exception objects. This 
is where you call die() with an object, not a string. The object then 
ends up in $@ and you can call methods on it to examine the error. While 
this doesn't have Java's stricter model exceptions, it can still help 
out in cases like yours where currently you're just getting an error 
_string_ and you want to parse that string in some way or get other 
information.

Some discussion at
http://www.perl.com/pub/a/2002/11/14/exception.html

P

-- 
pkent 77 at yahoo dot, er... what's the last bit, oh yes, com
Remove the tea to reply


------------------------------

Date: Tue, 20 Apr 2004 12:36:21 -0700
From: Arvin Portlock <apollock11@hotmail.com>
Subject: Re: XML::Xerces questions
Message-Id: <c63u3n$o4b$1@agate.berkeley.edu>

Oh I know what eval {} and $@ are all about. I'm just used
to seeing it as a way to catch runtime errrors, not as a built
in interface within a module to record messages. In fact
Xerces itself will experience runtime errors for certain
conditions. So it's still important to trap them in an eval.
But reporting validation errors doesn't seem the best use
for this, especially in a program which has as one of its
main functions the ability to validate a document. I was
hoping for something more along the lines of:

my $status = $parser->parse ($file);
if ($status->errors) {
    until ($status->errors->EOF) {
       print $status->errors->error;
       $status->errors->move_next();
    }
}

or something along those lines.

Tad McClellan wrote:

> Arvin Portlock  wrote:
>
>
> >1. The way to get validation errors seems incredibly odd
> >to me:
> >
> >eval {$parser->parse ($file)};
> >print $@;
> >
> >Is this the only way to get at error messages? Via $@?
> >Does this wrapper provide a more direct method? Does this
> >seem odd to anybody else in the perl community or is
> >it just me?
>
>
>
>    perldoc -f eval
>
>       ...
>       It is also Perl's exception trapping mechanism
>
>
> "eval BLOCK" and "if $@" is Perl's "try" and "catch" mechanism.
>
>





------------------------------

Date: Tue, 20 Apr 2004 13:57:48 -0700
From: Jim Gibson <jgibson@mail.arc.nasa.gov>
Subject: Re: XML::Xerces questions
Message-Id: <200420041357483877%jgibson@mail.arc.nasa.gov>

In article <c63m7q$kt9$1@agate.berkeley.edu>, Arvin Portlock
<apollock11@hotmail.com> wrote:

> I'm using the XML::Xerces module to validate batches of
> XML documents against a schema. 

[...]

> 
> 2. Is there any way to use local copies of the schemas
> rather than have Xerces fetch them from the web? In my
> XML documents the referenced schemas have the form:
> 
> xsi:schemaLocation="http://www.loc.gov/standards/mets/mets.xsd"
> 
> I.e., they are all URLs. 

Files can be URLs, too. Try "file://path/to/file".


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 6434
***************************************


home help back first fref pref prev next nref lref last post