[19243] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 1438 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Aug 3 18:10:34 2001

Date: Fri, 3 Aug 2001 15:10:13 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <996876612-v10-i1438@ruby.oce.orst.edu>
Content-Type: text

Perl-Users Digest           Fri, 3 Aug 2001     Volume: 10 Number: 1438

Today's topics:
        Perl dumps core :-(. <shah@typhoon.xnet.com>
    Re: Perl dumps core :-(. <shah@typhoon.xnet.com>
        Sending F Keys Through NET::Telnet (Chris Pickett)
    Re: Setting DOS Variables (Was: Matt Wrights formmail a <godzilla@stomp.stomp.tokyo>
    Re: Setting DOS Variables (Was: Matt Wrights formmail a <jcook@strobedata.com>
    Re: Setting DOS Variables (Was: Matt Wrights formmail a <godzilla@stomp.stomp.tokyo>
    Re: Splitting a text list into smaller lists? <krahnj@acm.org>
    Re: Strange behavior? (Anno Siegel)
    Re: Subject : randomizing a hash <superstu@stanford.edu>
    Re: Understanding parsers (Anno Siegel)
    Re: Understanding parsers (David Wall)
    Re: Understanding parsers (Eric Bohlman)
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 3 Aug 2001 20:02:06 +0000 (UTC)
From: Hemant Shah <shah@typhoon.xnet.com>
Subject: Perl dumps core :-(.
Message-Id: <9kevvu$65b$1@flood.xnet.com>



-- 
Hemant Shah                           /"\  ASCII ribbon campaign
E-mail: NoJunkMailshah@xnet.com       \ /  --------------------- 
                                       X     against HTML mail
TO REPLY, REMOVE NoJunkMail           / \      and postings      
FROM MY E-MAIL ADDRESS.           
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind,                Above opinions are mine only.
it's backed up on tape somewhere.      Others can have their own.


------------------------------

Date: Fri, 3 Aug 2001 20:03:27 +0000 (UTC)
From: Hemant Shah <shah@typhoon.xnet.com>
Subject: Re: Perl dumps core :-(.
Message-Id: <9kf02f$65b$2@flood.xnet.com>


Sorry about my previous empty post. My news reader screwed up.

Folks,

   I have written a perl module that is used by several scripts to update our
   database. The scripts are called from a SUID program. Several of you in
   clpmisc, helped me in fixing the Insecure dependencies. Now I have a
   another problem. 

   Here is little background on what these scripts do. Each of the script
   scans different type of source code (COBOL, C, Java, etc) and build the
   dependency database (i.e. import statements in java, #include in C).

   The script that scans Java code dies with segmentation fault in some of the
   program. It fails on the different lines in Java programs, but at the same
   point in perl script. 

   I ran the perl script directly instead of calling from SUID program, and it
   did not fail, but if I add -T switch it fails again. I added -d switch to
   debug the code and it did not fail. How do I figure out what the problem
   is?

   Following is the fragment of my perl module where it fails.

$DotName = UnTaintVariable $ARGV[0];

sub UnTaintVariable
{

   my($Var) = @_;

   if ($Var  =~ /^([\/:\s\w.\,]+)$/) # Matches "/",":"," ",",",[A-Z,a-z,0-9]
   {
     $Var = $1;
     return ($Var);
   }
   return;  # Return undef
}


sub InsertDependencyData
{
   my($TLineNum,$TDependsOn,$TDepType) = @_;
   my($LineNum) = 0;
   my($DependsOn) = "";
   my($DepType) = "";

   $LineNum = UnTaintVariable $TLineNum;
   $DependsOn = UnTaintVariable $TDependsOn; # This is where it fails.
   $DepType = UnTaintVariable $TDepType;


   eval
   {
      $InsertStmt_Dependency_Hdl->execute($DotName, $DependsOn, $LineNum, $DepType);
      $InsertStmt_Dependency_Hdl->finish();
   };
   if ($@)
   {
      ExitErr("InsertDependencyData: $@");
   }
}



If I replace
$DependsOn = UnTaintVariable $TDependsOn;

with

   if ($TDependsOn =~ /^([\/:\s\w.\,]+)$/ )
   {
      $DependsOn = $1;
   }


which is exactly what UnTaintVariable does, the script does not fail.

When I execute the script with -wt, I get following error message.
Segmentation fault(coredump)



When I ran it through dbx I got following stack trace:

$ dbx /usr/local/bin/perl core
Type 'help' for help.
warning: The core file is truncated.  You may need to increasethe ulimit
for file and coredump, or free some space on the filesystem.
reading symbolic information ...warning: Unable to access address 0x20337ad0 from core
warning: Unable to access address 0x20337ad0 from core
warning: Unable to access address 0x20337ad4 from core
warning: Unable to access address 0x20337ad4 from core
warning: Unable to access address 0x20337ad8 from core
     .
     .
     .
     .
     .
     .
[using memory image in core]
warning: Unable to access address 0xf0043114 from core

Segmentation fault in free_y at 0xd016cf18
0xd016cf18 (free_y+0x16c) 8107000c        lwz   r8,0xc(r7)
(dbx) t
free_y(??, ??) at 0xd016cf18
free(??) at 0xd016b0a0
Perl_safefree(??) at 0x10023e64
Perl_mg_free(??) at 0x1000fb60
Perl_leave_scope(??) at 0x1000be8c
Perl_pop_scope() at 0x100094f0
Perl_pp_leavesub() at 0x10006d6c
Perl_runops_standard() at 0x100acb88
S_run_body(??) at 0x1002becc
perl_run(??) at 0x1002bb0c
main(??, ??, ??) at 0x100003e4
(dbx) 


Here is the version of perl I am using on AIX 4.3.3:
$ perl -v

This is perl, v5.6.0 built for aix



-- 
Hemant Shah                           /"\  ASCII ribbon campaign
E-mail: NoJunkMailshah@xnet.com       \ /  --------------------- 
                                       X     against HTML mail
TO REPLY, REMOVE NoJunkMail           / \      and postings      
FROM MY E-MAIL ADDRESS.           
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind,                Above opinions are mine only.
it's backed up on tape somewhere.      Others can have their own.


------------------------------

Date: 3 Aug 2001 11:06:52 -0700
From: cpickett@c21rg.net (Chris Pickett)
Subject: Sending F Keys Through NET::Telnet
Message-Id: <a4183d67.0108031006.6e116e63@posting.google.com>

I am working on an interface script that will telnet to a server,
login, run a few commands and then capture screen output. But I need
to be able to send the return key, and f keys, specifically F9. I
tried just sending "F9", but that of course didn't work, does anyone
know how it is done?

Thanks,
Chris


------------------------------

Date: Fri, 03 Aug 2001 11:51:41 -0700
From: "Godzilla!" <godzilla@stomp.stomp.tokyo>
Subject: Re: Setting DOS Variables (Was: Matt Wrights formmail alert)
Message-Id: <3B6AF2BD.6C3FB99F@stomp.stomp.tokyo>

Jim Cook wrote:
 
> > > > The concept here, inspired by an article, is to set
> > > > a DOS environmental variable and have it "stick" even
> > > > after a Perl script closes. Otherwords, you could start
> > > > another script and access this variable; behavior just
> > > > like a proper .bat DOS file.

(snipped)
 
> The following snippet is something I used in DOS in 1992. This finds the
> primary command.com. The environment segment then has the stuff you can
> poke around in. I wrote my own "set" program so I could do stuff like
> "set a=1+1" to get the same as "set a=2".
 
> None of this works under NT, since the 16-bit app gets VDMmed and I only
> fiddle my own virtual space. Maybe this will give you something else to
> waste time with ... er ... play with.
 
> struct   mcb_struct {
>     byte end_flag;                              // M or Z


Thanks for this C code. I am quite talented at translating
C code into Perl code.

However, the true problem is a Perl script being a secondary
process, a child, unable to access the parent environment space.
It isn't a lack of environment space, it is a generic message
is generated, out of environment space, for instances of inability
to set an environment variable, regardless of reason for failure.
Clearly Bill Gates didn't visualize any need for more than 640K
of DOS memory, nor visual any need for scripting access to DOS
memory, however insufficient it is.

Could be there is a problem with DOS and Windows being blended
since more powerful and programmer friendly Win 3.x series.
For Win9.x and Win.me, DOS environment space is adjustable
automatically, given the right API style commands. Obviously,
typical Perl scripts lack this ability found in binary programs.

I visualize two possiblities. One possibility is to have a 
Perl script initiate and run under a second command process
and set its own environment variables, much like do many
binary executables. The other, develop a method for a child
process to access a parent's space. 

My kid is good at this; she is always in both my space and face.


Godzilla!   Daughter Of Her Daughter Mother.


"Mother, you CANNOT skinny dip while my friends are over."

"Why not? The boys quite apparently enjoy this."

"MOM! You know what I mean!"

"Ahh... but you don't know what I mean."

"That does it! You are grounded for a week, Mom. Hand over your 'Vette keys."


------------------------------

Date: 03 Aug 2001 19:40:49 GMT
From: Jim Cook <jcook@strobedata.com>
Subject: Re: Setting DOS Variables (Was: Matt Wrights formmail alert)
Message-Id: <3B6AFE41.588B3CC9@strobedata.com>

> However, the true problem is a Perl script being a secondary
> process, a child, unable to access the parent environment space.

That's what this code did. The program ran as a secondary and went
sniffing through memory looking for the parent memory control block.
That has the PSP and the environment segment address of the parent. So,
the snippet I sent does quite successfully modify the parent
environment. Or at least access it -- it's trivial to modify it once
you're there.

The find_mcb routine sets the global mcb variable to be the parent mcb.
The show_psp then accesses the psp off of that to get the environment
segment address.

--
jcook@strobedata.com  Live Honourably  4/1 - 4/3 + 4/5 - 4/7 + . . .
2001 Wed: Feb/last 4/4 6/6 8/8/ 10/10 12/12 9/5 5/9 7/11 11/7 3/14
Strobe Data Inc. home page   http://www.strobedata.com
My home page    O-           http://jcook.net


------------------------------

Date: Fri, 03 Aug 2001 13:50:48 -0700
From: "Godzilla!" <godzilla@stomp.stomp.tokyo>
Subject: Re: Setting DOS Variables (Was: Matt Wrights formmail alert)
Message-Id: <3B6B0EA8.956CE036@stomp.stomp.tokyo>

Jim Cook wrote:

(snipped)
 
> > However, the true problem is a Perl script being a secondary
> > process, a child, unable to access the parent environment space.
 
> That's what this code did. The program ran as a secondary and went
> sniffing through memory looking for the parent memory control block.

"does"

Thank you. I have not examined your code closely.
I simply copied and stored it for future reference
knowing you posted related code. I also made note
of your 16 bit advisory. Win9.x and Win.me both
run 16 bit helper applications. This should work.
However, knowing Bill Gates, it is a shaky should.


Godzilla!
-- 
$_="47З85Х585Б35ЗБ575Ч5Ч5ЕЗУ515БЕ0";
tr/ЕЗЙ873514УХЧЩЫБ/975318642abcdef/;
s/([0-9A-Fa-f]{2})/sprintf("%c",hex($1))/ge;
tr/УФХЦЧШЩЪЫЬБВГДЕЖЗИЙКЛМНОПР/ H-OV-ZP-UA-G/;
print$_=reverse$_;exit;


------------------------------

Date: Fri, 03 Aug 2001 20:26:17 GMT
From: "John W. Krahn" <krahnj@acm.org>
Subject: Re: Splitting a text list into smaller lists?
Message-Id: <3B6B08E8.6517C495@acm.org>

Yves Orton wrote:
> 
> Ok. Well, not to be argumentative :-) , but I dont see how.  If after
> reading the last line of the file @in<50 then the output while will
> never be entered and the output will fail. Consider the following code
> which is basically your routine but pared down (no real file stuff and
> 5 instead of 50)
> 
> #!/usr/bin/perl -w
> use strict;
> my $COUNT=5;
> my @in;
> while ( <DATA> ) {
>     local $, = ",";
>     local $\ = "\n";
>     chomp;
>     push @in, split /4/;
>     while ( @in >= $COUNT ) {           # change 1
>         print splice( @in, 0, $COUNT );
>         redo if eof(DATA) and @in;      # change 2
>     }
> }
> __DATA__
> Name4Name4Name4Name4Name4
> Name4Name4Name4Name4Name4
> NAME4NAME4
> --------------------------------OUTPUT-----------------------------
> Name,Name,Name,Name,Name
> Name,Name,Name,Name,Name
> --------------------------------END OUTPUT-------------------------
> And then if you change the two lines as so:
> change 1:while ( @in >= $COUNT || (my $flush=eof(DATA))) {
> change 2:        last if $flush;
> 
> I believe that you _should_  get the following (I do)
> --------------------------------OUTPUT-----------------------------
> Name,Name,Name,Name,Name
> Name,Name,Name,Name,Name
> NAME,NAME
> --------------------------------END OUTPUT-------------------------
> 
> So my first thought would be to check your test data and see if it
> actually _does_ test this particular problem. :-)

Well, I think the discussion is moot as the OP said that the file has
25000 users. But anyway, this will work no matter what input:


#!/usr/bin/perl -w
use strict;

my $file = 'userfile001';
my @in;
# Read the file of 25000 users from stdin or the command line:
while ( <> ) {
    local $, = local $\ = "\n";
    chomp;
    # Parse the line, using the numeral 4 as the delimiter.
    push @in, split /4/;

    while ( @in >= 50 ) {
OUT:    open OUT, "> $file" or die "Cannot write to $file: $!";
        print OUT splice( @in, 0, 50 );
        $file++;
        close OUT;
        }
    goto OUT if eof() and @in;
    }


:-)))

John
-- 
use Perl;
program
fulfillment


------------------------------

Date: 3 Aug 2001 18:15:18 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: Strange behavior?
Message-Id: <9kepnm$hj8$2@mamenchi.zrz.TU-Berlin.DE>

According to Yves Orton <demerphq@hotmail.com>:
> anno4000@lublin.zrz.tu-berlin.de (Anno Siegel) wrote in message
> news:<9ke97p$58c$3@mamenchi.zrz.TU-Berlin.DE>...
> > According to Roman Khutkyy  <sky@mail.lviv.ua>:
> > > Pardon...... $try{22}
> > 
> > Still nonsense.
> > 
> > Anno
> 
> Actually Anno its not nonsense.  For some reason all of the following
> are the same.  This is a semi-bug in perl apparently that probably
> wont be fixed (or so I have been told) as many scripts would break.
> 
> Try this:
> #!perl
> use strict;
> use warnings;
> my %hash=(key=>'value');
> my $r_hash=\%hash;
> 
> print $hash{key}    ."\n";     # standard 
> print $r_hash->{key}."\n";     # standard reference
> print %hash->{key}  ."\n";     # should not work but does
> __END__
> And you should get:
> ------OUTPUT--------
> value
> value
> value
> ------END OUTPUT----
> as output.

Damn!  I thought I'd tested that.

It's still nonsense :)

Anno


------------------------------

Date: Fri, 3 Aug 2001 14:16:41 -0700
From: Stuart Aaron White <superstu@stanford.edu>
Subject: Re: Subject : randomizing a hash
Message-Id: <Pine.GSO.4.31.0108031410390.6470-100000@epic26.Stanford.EDU>




On Fri, 3 Aug 2001, gnari wrote:

> In article <Pine.GSO.4.31.0108021702570.29193-100000@epic23.Stanford.EDU>,
> Stuart Aaron White  <superstu@stanford.edu> wrote:
> >----- Message Text -----
> >
> >I'm trying to randomize a hash using the rand() function.  I decided that
> >I can't do it that way, so I used the keys() function to get an array of
> >the keys, randomized the keys like rand(@keyarray), and now want to guess
> >the
> >value based on the key.
> >
>
> please show us actual code, not unclear descriptions like your
> 'randomized the keys like rand(@keyarray)'.
> if that meant that you did
>   rand(@keyarray);
> and expected the order of the elements in the array to be shuffled,
> then that is your problem. in that case you should look at
>   perldoc -f rand
>   perldoc -q shuffle
>
> gnari



Will do.  Here is the code:
There's a couple print lines near the bottom for testing, but that's all.
Like I said earlier, I'm attempting to randomize a list, preferably a
hash, so I can offer up a key and ask for the matching value.  Any hdlp
would be appreciated.
Oh, and will i find perldoc -f rand  and perldoc -q shuffle at perl.com?
Thanks, stu



#!/usr/bin/perl -w
srand;

$vocab{"one"} = "uno";
$vocab{"two"} = "dos";
$vocab{"three"} = "tres";
$vocab{"four"} = "cuatro";
$vocab{"five"} = "cinco";
$vocab{"six"} = "seis";
$vocab{"seven"} = "siete";
$vocab{"eight"} = "ocho";
$vocab{"nine"} = "nueve";
$vocab{"ten"} = "diez";

@native = keys(%vocab);         # create an array of native lang keys
$numNative = @native;           # find out the length of the array


print ("@native\n");
$randomKey = "$native[rand($native)]";
chomp ($randomKey);
print "here is a test of the random func.  $randomKey\n";


print "another test:  $randomKey\n";

$randomValue =  $vocab{"$randomKey"};
chomp ($randomValue);
print "test of value matching key: $randomValue\n";
print ("this is the number of keys in the vocab hash: $numNative\n");


print ("now we'll start the game.\n");

for ($i = 0; $i <= $numNative; $i++)
{
    print ("what is the spanish word for $randomKey?\n");
    $guess = <STDIN>;
    if ($guess eq $randomValue)
    {
        print ("You got it!\n");
    }
    else
    {
        print ("That is incorrect.  The correct answer is:
$randomValue\n");
    }
}




------------------------------

Date: 3 Aug 2001 18:47:07 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: Understanding parsers
Message-Id: <9kerjb$hj8$4@mamenchi.zrz.TU-Berlin.DE>

According to David Wall <darkon@one.net>:
> 
> I've been trying to learn how to use modules such as HTML::Parser and 
> Parse::RecDescent, but I think I lack the background to completely 
> understand the documentation.  Should I get a book or two from the 

Parse::RecDescent and HTML::Parser are very different beasts, and
understanding one won't help you much in understanding the other.

> comp.compilers FAQ, or are there other references I should read first?  
> URLs are ok, but I generally prefer books that I can read at leisure.

comp.compilers may offer adequate literature to learn about recursive
descent parsers.  Parse::RecDescent is a parser generator and you
might use it if you wanted to write your own HTML::Parser.

To understand HTML::Parser (as little I know about that module) you
would first have to understand HTML.  It does the parsing for you
(or some say it doesn't really parse HTML, I don't know), but the
results it offers are in terms of HTML.

Anno


------------------------------

Date: Fri, 03 Aug 2001 20:49:31 -0000
From: darkon@one.net (David Wall)
Subject: Re: Understanding parsers
Message-Id: <Xns90F2AADC841CEdarkononenet@207.126.101.97>

anno4000@lublin.zrz.tu-berlin.de (Anno Siegel) wrote on 03 Aug 2001:

> 
> To understand HTML::Parser (as little I know about that module) you
> would first have to understand HTML.  It does the parsing for you
> (or some say it doesn't really parse HTML, I don't know), but the
> results it offers are in terms of HTML.

Ok, I'll read the HTML::Parser docs again, more closely this time.  
Parse::RecDescent can wait a while.

Here's one reason I was asking:  I had an HTML document with a large number 
of nested lists to which I wanted to add named anchors;  it's a procedures 
manual with many internal cross-references.  Detecting the places where 
there was a reference seemed very difficult (maybe impossible), but I knew 
I could save much time by having a program add the named anchors for me, so 
I wrote a program to do that.  The resulting document I've temporarily 
placed at http://w3.one.net/~darkon/temp/hheman-may2001.html (about 280KB).  

The program I used is below.  Keep in mind that I'm not proud of this 
program, but it did perform well enough to save me some time and work, and 
that's really all I needed.  However, it's very ad hoc, I think it's ugly, 
and so would prefer to write something more elegant.  Hence my interest in 
parsers.

If nothing else, maybe M-J Dominus could use it as part of his "Red Flag" 
series on mistakes made by novice programmers.  :-)


#!/usr/bin/perl -w

# I used this program to add named anchors to the HHE procedures manual.
# To get it to work, I ran the HTML through Dave Raggett's HTML Tidy
# utility, which formats the HTML in a way that made it easy for me to
# process it.  In particular, I was able to make the following assumptions
# about the format of the document:
#
# * <ol>, </ol>, <ul>, </ul>, and <li> are always on lines by themselves.
# * The ordered lists I'm interested in always have a <p> in the <li>
# * Certain assumptions about level-two headers (<h2>)

use strict;
use warnings;

my @counters;
$counters[0]= [ 'A'..'Z' ];
$counters[1]= [ 1..100 ];
$counters[2]= [ 'a'..'z' ];
$counters[3]= [ 1..100 ];
$counters[4]= [ qw/i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi
                xvii xviii xix xx xxi xxii xxiii xxiv xv/ ];
$counters[5]= [ 'a'..'z' ];
$counters[6]= [ qw/i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi
                xvii xviii xix xx xxi xxii xxiii xxiv xv/ ];

my @ol;
my $level = -1;
my $sp = "   ";
my $section = "TESTING";

my $DEBUG=0;
my $TEST=0;

open FILE, "rev2.html" or die "$!";

INPUT: while (<FILE>) {

    if ($DEBUG) { printf "\t\t\t\t\t|%04d|%s", $., $_; }
    else { print; }

    if ( /<h2>Section ([VIX]+)\./ ) {
        $section = $1;
        next INPUT if ( $section eq 'I' || $section eq 'II' 
                        || $section eq 'III' );
        $level = -1;
        next INPUT;
    }

    next INPUT if ! $section; # skip until we get to Roman-numeral sections
    # nothing until we get to Section IV
    next INPUT if ( $section eq 'I' || $section eq 'II' 
                    || $section eq 'III' );
    
    if ( /<ol[^>]+>/ ) {
        finish_styled_ol(0);
        next INPUT;
    }

    if ( /<ul.*?>/ ) {
        # I should use a recursive function for more robustness,
        # but this works fine for this document
        while ( <FILE> ) {  
            if ($DEBUG) { printf "\t\t\t\t\t+%04d+%s", $., $_; }
            else { print; }
            last if /<\/ul>/;
        }
        next INPUT;
    }
    
    if ( /<ol>/ ) {
        $level++;
        if ($DEBUG) { print "Begin <ol> $level\n"; }
        push @ol, -1;
        next INPUT;
    }

    if ( /<li>/ && $level >= 0 ) {
        $ol[ $level ] += 1;
        my $bookmark = $section;
        for (my $i=0; $i <= $level; $i++) {
            $bookmark .= ".$counters[$i][ $ol[$i] ]";
        }
        if ($DEBUG) { print "${bookmark}\n"; }
        my $line = <FILE>;
        if ($TEST) { print "<p><b>$bookmark</b></p>\n"; }
        $line =~ s/<p>([^\s<]+)/<p><a name="$bookmark">$1<\/a>/;
        if ($DEBUG) { printf "\t\t\t\t\t|%04d|%s", $., $line; }
        else { print $line; }
        next INPUT;
    }

    if ( /<\/ol>/ ) {
        pop @ol;
        $level--;
        next INPUT;
    }
}

sub finish_styled_ol {
    my $slevel = shift @_;
    $slevel++;
    LINE: while ( <FILE> ) {
        if ($DEBUG) { printf "\t\t\t\t\t+%04d+%s", $., $_; }
        else { print; }
        if (/<ol>/) {
            finish_styled_ol($slevel);
            next LINE;
        }
        last if /<\/ol>/;
    }
    return;
}

-- 
David Wall
darkon@one.net


------------------------------

Date: 3 Aug 2001 21:52:22 GMT
From: ebohlman@omsdev.com (Eric Bohlman)
Subject: Re: Understanding parsers
Message-Id: <9kf6em$eg2$1@bob.news.rcn.net>

David Wall <darkon@one.net> wrote:
> The program I used is below.  Keep in mind that I'm not proud of this 
> program, but it did perform well enough to save me some time and work, and 
> that's really all I needed.  However, it's very ad hoc, I think it's ugly, 
> and so would prefer to write something more elegant.  Hence my interest in 
> parsers.

> If nothing else, maybe M-J Dominus could use it as part of his "Red Flag" 
> series on mistakes made by novice programmers.  :-)


> #!/usr/bin/perl -w

> # I used this program to add named anchors to the HHE procedures manual.
> # To get it to work, I ran the HTML through Dave Raggett's HTML Tidy
> # utility, which formats the HTML in a way that made it easy for me to
> # process it.  In particular, I was able to make the following assumptions
> # about the format of the document:
> #
> # * <ol>, </ol>, <ul>, </ul>, and <li> are always on lines by themselves.

Using a proper parser would free you of this assumption.

> # * The ordered lists I'm interested in always have a <p> in the <li>

This one, on the other hand, is pretty-much unavoidable; it's the pattern 
you're trying to recognize.

> # * Certain assumptions about level-two headers (<h2>)

> use strict;
> use warnings;

This is actually redundant, since it does the same thing as the -w flag on 
your shebang line.

> my @counters;
> $counters[0]= [ 'A'..'Z' ];
> $counters[1]= [ 1..100 ];
> $counters[2]= [ 'a'..'z' ];
> $counters[3]= [ 1..100 ];
> $counters[4]= [ qw/i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi
>                 xvii xviii xix xx xxi xxii xxiii xxiv xv/ ];
> $counters[5]= [ 'a'..'z' ];
> $counters[6]= [ qw/i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi
>                 xvii xviii xix xx xxi xxii xxiii xxiv xv/ ];

You could save yourself a bunch of typing (and a bunch of opportunties for
typos) by initializing the array from a list:

my @counters=
 ([ 'A'..'Z' ],
  [ 1..100 ],
  ...
  [ qw/i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi
    xvii xviii xix xx xxi xxii xxiii xxiv xv/ ]
 );

> my @ol;
> my $level = -1;
> my $sp = "   ";
> my $section = "TESTING";

> my $DEBUG=0;
> my $TEST=0;

> open FILE, "rev2.html" or die "$!";

> INPUT: while (<FILE>) {

>     if ($DEBUG) { printf "\t\t\t\t\t|%04d|%s", $., $_; }
>     else { print; }

>     if ( /<h2>Section ([VIX]+)\./ ) {
>         $section = $1;
>         next INPUT if ( $section eq 'I' || $section eq 'II' 
>                         || $section eq 'III' );
>         $level = -1;

You could replace the above two lines with:

$level = -1 unless $section =~ /^I{1,3}$/;

>         next INPUT;

>     }

>     next INPUT if ! $section; # skip until we get to Roman-numeral sections

next INPUT unless $section;

would be more Perlish.

>     # nothing until we get to Section IV
>     next INPUT if ( $section eq 'I' || $section eq 'II' 
>                     || $section eq 'III' );

Again, you could use a regex here.

>     
>     if ( /<ol[^>]+>/ ) {
>         finish_styled_ol(0);
>         next INPUT;
>     }

>     if ( /<ul.*?>/ ) {
>         # I should use a recursive function for more robustness,
>         # but this works fine for this document
>         while ( <FILE> ) {  
>             if ($DEBUG) { printf "\t\t\t\t\t+%04d+%s", $., $_; }
>             else { print; }
>             last if /<\/ul>/;
>         }
>         next INPUT;
>     }

I think it would be clearer if you used an if {} elsif {} ... else {} 
structure here rather than a chain of ifs with a next at the end of each 
block.  It makes it immediately clear that only one of the alternatives 
applies each time around the loop.

>     
>     if ( /<ol>/ ) {
>         $level++;
>         if ($DEBUG) { print "Begin <ol> $level\n"; }
>         push @ol, -1;
>         next INPUT;
>     }

>     if ( /<li>/ && $level >= 0 ) {
>         $ol[ $level ] += 1;

++$ol[$level];

>         my $bookmark = $section;
>         for (my $i=0; $i <= $level; $i++) {

foreach my $i (0..$level) {

>             $bookmark .= ".$counters[$i][ $ol[$i] ]";
>         }
>         if ($DEBUG) { print "${bookmark}\n"; }
>         my $line = <FILE>;
>         if ($TEST) { print "<p><b>$bookmark</b></p>\n"; }
>         $line =~ s/<p>([^\s<]+)/<p><a name="$bookmark">$1<\/a>/;
>         if ($DEBUG) { printf "\t\t\t\t\t|%04d|%s", $., $line; }
>         else { print $line; }
>         next INPUT;
>     }

>     if ( /<\/ol>/ ) {
>         pop @ol;
>         $level--;
>         next INPUT;
>     }
> }

> sub finish_styled_ol {
>     my $slevel = shift @_;
>     $slevel++;
>     LINE: while ( <FILE> ) {
>         if ($DEBUG) { printf "\t\t\t\t\t+%04d+%s", $., $_; }
>         else { print; }
>         if (/<ol>/) {
>             finish_styled_ol($slevel);
>             next LINE;
>         }
>         last if /<\/ol>/;
>     }
>     return;
> }



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc.  For subscription or unsubscription requests, send
the single line:

	subscribe perl-users
or:
	unsubscribe perl-users

to almanac@ruby.oce.orst.edu.  

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.

For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 1438
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[19243] in Perl-Users-Digest

Perl-Users Digest, Issue: 1438 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Fri Aug 3 18:10:34 2001

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Aug 3 18:10:34 2001