[31557] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 2816 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Feb 13 14:09:24 2010

Date: Sat, 13 Feb 2010 11:09:07 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sat, 13 Feb 2010     Volume: 11 Number: 2816

Today's topics:
    Re: a defense of ad hoc software development (Robert Maas, http://tinyurl.com/uh3t)
    Re: Is a merge interval function available? sln@netherlands.com
        know-how(-not) about regular expressions <hhr-m@web.de>
    Re: know-how(-not) about regular expressions <peter@makholm.net>
    Re: know-how(-not) about regular expressions <jurgenex@hotmail.com>
    Re: know-how(-not) about regular expressions <hhr-m@web.de>
    Re: know-how(-not) about regular expressions <rvtol+usenet@xs4all.nl>
    Re: know-how(-not) about regular expressions sln@netherlands.com
    Re: know-how(-not) about regular expressions <hhr-m@web.de>
    Re: know-how(-not) about regular expressions sln@netherlands.com
        Login using WWW::Mechanize pastrufazio@gmail.com
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 12 Feb 2010 23:37:12 -0800
From: seeWebInstead@rem.intarweb.org (Robert Maas, http://tinyurl.com/uh3t)
Subject: Re: a defense of ad hoc software development
Message-Id: <REM-2010feb12-001@Yahoo.Com>

> >> No matter how precise the specifications, programming always involves
> >> a certain amount of exploration -- usually a lot more than anyone had
> >> anticipated.
> > There is one exception where very little experimentation may be
> > needed, namely implementing formal protocol specifications, such as
> > RFC822.
> From: p...@informatimago.com (Pascal J. Bourguignon)
> In theory.  But in practice, you still need to bump against other
> implementations to learn how to treat the cases that are not covered
> by the protocol.

Yeah. I was thinking more of the well established RFCs that have
become de facto technical standards, not poorly written (ambiguous)
early drafts. Indeed RFC means Request For Comments, and with an
early ambiguous draft that should be taken literally, i.e. comments
*should* be offered. (-: "should" in the RFC sense of course. :-)

And just a clarification: I said "very little" rather than "no"
experimentation. With well established de-facto-standard RFCs, I
emphasize "very". With not so established RFCs, omit "very", and
with poorly written RFCs your point wins.

> There are RFC that are specifically describing the
> "best practices", which means that there are a lot of
> implementations implementing "worst practices" or "good-enough
> practices".

I was thinking of the tech-spec RFCs, which say things like return
code 550 means refused for policy reasons, which IMO really ought
to be respected as best as is possible. There's a little bit of
judgement if more than one return code could possibly apply, but
the basic protocol ought to be very rigid regarding syntax of
client requests and syntax of server responses.


------------------------------

Date: Fri, 12 Feb 2010 03:39:06 -0800
From: sln@netherlands.com
Subject: Re: Is a merge interval function available?
Message-Id: <88fan5t2pe3epif0m4ke6aktoj9uccu2s5@4ax.com>

On Wed, 10 Feb 2010 15:16:40 -0800 (PST), Peng Yu <pengyu.ut@gmail.com> wrote:

>I'm wondering there is already a function in perl library that can
>merge intervals. For example, if I have the following intervals ('['
>and ']' means closed interval as in
>http://en.wikipedia.org/wiki/Interval_(mathematics)#Excluding_the_endpoints)
>
>[1, 3]
>[2, 9]
>[10,13]
>[11,12]
>
>I want to get the following merged intervals.
>
>[1,9]
>[10,13]
>
>Could somebody let me know if there is a function in the perl library?

Should be non-recursive, my bad.

-sln

-------------
use strict;
use warnings;

my @sets = (
  [20, 40],
  [1, 3],
  [2, 9],
  [10,13],
  [11,13],
  [1, 2],
  [11,12],
  [1, 5],
  [7, 15],
);

my @merged_sets = mrgInterval(\@sets);

print "\nOriginal sets:\n";
for my $st (@sets) {
    if (@{$st}) {
        printf "  %2s, %2s\n", @{$st};
    }
}
print "\nMerged sets:\n";
for my $st (@merged_sets) {
    printf "  %2s, %2s\n", @{$st};
}
exit (0);    

##
sub mrgInterval
{
    my ($sref,$start) = @_;
    return if (ref($sref) ne 'ARRAY');

    if (!defined $start) {
	if (wantarray) {
            my @tmpsets = map {[@{$_}]} @{$sref};
            $sref = \@tmpsets;
        }
        @{$sref} = sort {$a->[0]<=>$b->[0] || $a->[1]<=>$b->[1]} @{$sref};
        $start = 0;
    }

    while ($start < @{$sref})
    {
        my $last = $sref->[$start];
        ++$start;

        if (@{$last}) {
            for my $ndx ($start .. @{$sref}-1)
            {
                my $cur = $sref->[$ndx];
                next if (!@{$cur});

                if ($cur->[0] >= $last->[0] && $cur->[0] <= $last->[1] )
                {
                    if ($cur->[1] > $last->[1]) {
                        $last->[1] = $cur->[1];
                    }
                    @{$cur} = ();
                }
                else {
                    last;
                }
            }
        }
    }
    if (wantarray) {
        return sort {$a->[0] <=> $b->[0]} map {@{$_} ? $_ : () } @{$sref};
    }
}

__END__

Original sets:
  20, 40
   1,  3
   2,  9
  10, 13
  11, 13
   1,  2
  11, 12
   1,  5
   7, 15

Merged sets:
   1, 15
  20, 40



------------------------------

Date: Fri, 12 Feb 2010 12:40:14 +0100
From: Helmut Richter <hhr-m@web.de>
Subject: know-how(-not) about regular expressions
Message-Id: <Pine.LNX.4.64.1002121213440.4425@lxhri01.lrz.lrz-muenchen.de>

For a seemingly simple problem with regular expressions I tried out several
solutions. One of them seems to be working now, but I would like to learn why
the solutions behave differently. Perl is 5.8.8 on Linux.

The task is to replace the characters # $ \ by their HTML entity, e.g. &#35;
but not within markup. The following code reads and consumes a variable
$inbuf0 and builds up a variable $inbuf with the result.

Solution 1:

while ($inbuf0) {
  $inbuf0 =~ /^(?: # skip initial sequences of
      [^<\&#\$\\]+   # harmless characters
    | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*>  # start tags
    | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*>  # end tags
    | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+));  # entity or character references
    | <!--(?:.|\n)*?-->  # comments
    | <[?](?:.|\n)*?[?]> #  processing instructions, etc.
    )*/x;
  $inbuf .= $&;
  $inbuf0 = $';
  if ($inbuf0) {
    $inbuf .= '&#' . ord($inbuf0) . ';';
    substr ($inbuf0, 0, 1) = '';
    $replaced = 1;
  };
}; 

Here the regexp eats up the maximal initial string (note the * at the end of
the regexp) that needs not be processed and then processes the first character
of the remainder.

In this version, it sometimes works and sometimes blows up with segmentation
fault.

Another version has * instead of + at the "harmless characters". That one does
not try all alternatives as the first one matches always, that is, the * at
the end of the regexp is not used in this case.

Yet another version has nothing instead of + at the "harmless characters";
thus eating zero or one character per iteration of the final *. This should
have the same net effect, but it always blows up with segmentation fault.


Solution 2:

while ($inbuf0) {
  if ($inbuf0 =~ /^# skip initial 
      [^<\&#\$\\]+   # harmless characters
    | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*>  # start tags
    | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*>  # end tags
    | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+));  # entity or character references
    | <!--(?:.|\n)*?-->  # comments
    | <[?](?:.|\n)*?[?]> #  processing instructions, etc.
    /x) {
    $inbuf .= $&;
    $inbuf0 = $';
  } else {
    $inbuf .= '&#' . ord($inbuf0) . ';';
    substr ($inbuf0, 0, 1) = '';
    $replaced = 1;
  };
};

Here the regexp eats up an initial string, typically not maximal (note the
absence of * at the end of the regexp), that needs not be processed and, if
nothing has been found, processes the first character of the input.
 
This version runs considerably slower, by a factor of three, but has so far
not yielded segmentation faults. I am using it now.

I am sure there are lots of other ways to do it. With which knowledge 
could I have saved the time of the numerous trial-and-error cycles and 
done it alright from the beginning?

-- 
Helmut Richter


------------------------------

Date: Fri, 12 Feb 2010 13:09:47 +0100
From: Peter Makholm <peter@makholm.net>
Subject: Re: know-how(-not) about regular expressions
Message-Id: <87fx56ac50.fsf@vps1.hacking.dk>

Helmut Richter <hhr-m@web.de> writes:

> For a seemingly simple problem with regular expressions I tried out several
> solutions. One of them seems to be working now, but I would like to learn why
> the solutions behave differently. Perl is 5.8.8 on Linux.

The regexp engine in perl 5.8.8 is implemented by recursion. This is
known to cause segmentation faults on some occasions. See
http://www.nntp.perl.org/group/perl.perl5.porters/2006/05/msg113036.html

Upgrading to perl 5.10 solves this issue by making the regexp engine
iterative instead.

> The task is to replace the characters # $ \ by their HTML entity, e.g. &#35;
> but not within markup. The following code reads and consumes a variable
> $inbuf0 and builds up a variable $inbuf with the result.

Trying to handle XML and HTML correctly by parsing it with regular
expressions isn't recommended at all. I would use some XML parser and
walk through the DOM and change the content of text nodes with the
trivial substitution on each text node.

//Makholm


------------------------------

Date: Fri, 12 Feb 2010 08:03:30 -0800
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: know-how(-not) about regular expressions
Message-Id: <8buan5pv7ucut92dvqro5055uq6qics8b8@4ax.com>

Helmut Richter <hhr-m@web.de> wrote:
>For a seemingly simple problem with regular expressions I tried out several
>solutions. One of them seems to be working now, but I would like to learn why
>the solutions behave differently. Perl is 5.8.8 on Linux.
>
>The task is to replace the characters # $ \ by their HTML entity, e.g. &#35;
>but not within markup. 
[...]

You may want to read up on Chomsky hierarchy. HTML is a not a a regular
language but a context-free language. Therefore it cannot be parsed by a
regular engine.

Granted, Perl's Regular Expressions have extensions that make them
significantly more powerful than a formal regular engine, but they are
still the wrong tool for the job.  Use any standard HTML parser to
dissect your file into its components and then apply your substitution
to those components where you want them applied.

jue


------------------------------

Date: Fri, 12 Feb 2010 17:52:29 +0100
From: Helmut Richter <hhr-m@web.de>
Subject: Re: know-how(-not) about regular expressions
Message-Id: <Pine.LNX.4.64.1002121745000.4425@lxhri01.lrz.lrz-muenchen.de>

On Fri, 12 Feb 2010,  wrote:

> You may want to read up on Chomsky hierarchy. HTML is a not a a regular
> language but a context-free language. Therefore it cannot be parsed by a
> regular engine.

But the distinction of markup and non-markup is. The only parenthesis-like 
structure I have so far found is the nesting of brackets in <!CDATA[ ... ]]>
but this is also regular, as ]]> cannot occur inside.

*If* I were interested in the semantics of the tags, I would probably 
follow the advice given here to use an XML analyser, provided I keep the 
control of what to do when the input is not well-formed XML. Just being 
told "your data is not okay, so cannot do anything with it" would not 
suffice: Even in an environment where the end-user has full control of 
everything, it is not always the best idea to have him fix every error 
before proceeding; sometimes it is better to let errors in the input and 
fix them at a later step.

-- 
Helmut Richter


------------------------------

Date: Fri, 12 Feb 2010 19:41:58 +0100
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: know-how(-not) about regular expressions
Message-Id: <4b75a0f6$0$22933$e4fe514c@news.xs4all.nl>

Helmut Richter wrote:

 > [again parsing the wrong way]

Is there a newsgroup or mailing list that we can refer "them" to?
I am sure that we are well past our monthly share already.

-- 
Ruud


------------------------------

Date: Fri, 12 Feb 2010 15:22:20 -0800
From: sln@netherlands.com
Subject: Re: know-how(-not) about regular expressions
Message-Id: <5embn59c21iemrk3jatc060edbkamimiaq@4ax.com>

On Fri, 12 Feb 2010 12:40:14 +0100, Helmut Richter <hhr-m@web.de> wrote:

>For a seemingly simple problem with regular expressions I tried out several
>solutions. One of them seems to be working now, but I would like to learn why
>the solutions behave differently. Perl is 5.8.8 on Linux.
>
>The task is to replace the characters # $ \ by their HTML entity, e.g. &#35;
>but not within markup.
                 ^^^
I find that odd but I guess you would have to parse out att-val's to do
that.

> The following code reads and consumes a variable
>$inbuf0 and builds up a variable $inbuf with the result.
>
>Solution 1:
>
>while ($inbuf0) {
>  $inbuf0 =~ /^(?: # skip initial sequences of
>      [^<\&#\$\\]+   # harmless characters
>    | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*>  # start tags
>    | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*>  # end tags
>    | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+));  # entity or character references
>    | <!--(?:.|\n)*?-->  # comments
>    | <[?](?:.|\n)*?[?]> #  processing instructions, etc.
>    )*/x;
                 ^^^
This is good, your trying to filter out not only all the markup,
but references as well.
However, some forms are omitted and the ones there are partially in error.
This is no big deal, but the stream has to be partitioned precisely to
extract segments with %100 certainty. This means a little more robust
structure to allow stream realignment in the case of a bad markup.
This is because validation is missing, but you don't care about that, you
just don't want to stop in that case.

But your expression will get you close.

>  $inbuf .= $&;
>  $inbuf0 = $';
>  if ($inbuf0) {
>    $inbuf .= '&#' . ord($inbuf0) . ';';
>    substr ($inbuf0, 0, 1) = '';
>    $replaced = 1;
>  };
>}; 
>
>Here the regexp eats up the maximal initial string (note the * at the end of
>the regexp) that needs not be processed and then processes the first character
>of the remainder.
>
>In this version, it sometimes works and sometimes blows up with segmentation
>fault.

The code above is wrong, you don't check for a sucessful match, *substr* is
going GPF on your ass! (mmm substr(), gpf paradise)

>
>Another version has * instead of + at the "harmless characters". That one does
>not try all alternatives as the first one matches always, that is, the * at
>the end of the regexp is not used in this case.
>
>Yet another version has nothing instead of + at the "harmless characters";
>thus eating zero or one character per iteration of the final *. This should
>have the same net effect, but it always blows up with segmentation fault.
>
>
>Solution 2:
>
>while ($inbuf0) {
>  if ($inbuf0 =~ /^# skip initial 
>      [^<\&#\$\\]+   # harmless characters
>    | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*>  # start tags
>    | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*>  # end tags
>    | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+));  # entity or character references
>    | <!--(?:.|\n)*?-->  # comments
>    | <[?](?:.|\n)*?[?]> #  processing instructions, etc.
>    /x) {
>    $inbuf .= $&;
>    $inbuf0 = $';
>  } else {
>    $inbuf .= '&#' . ord($inbuf0) . ';';
>    substr ($inbuf0, 0, 1) = '';
>    $replaced = 1;
>  };
>};
>

Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
of thing. I'm not sure but it looks like the regex is being initialized every time
through the loop.
Also, a while(length $inbuf0) might be more readable.

>Here the regexp eats up an initial string, typically not maximal (note the
>absence of * at the end of the regexp), that needs not be processed and, if
>nothing has been found, processes the first character of the input.
> 
>This version runs considerably slower, by a factor of three, but has so far
>not yielded segmentation faults. I am using it now.
>
>I am sure there are lots of other ways to do it. With which knowledge 
>could I have saved the time of the numerous trial-and-error cycles and 
>done it alright from the beginning?

I've pieced some code together that may help you on this.
The ordering (alternation) of the markup forms are very important,
take note of them.
Especially - CDATA before comments and finally *content* must always,
always be last.

The ordering shown below is absolutely crucial to
correctly partition markup!
The biggest mistake people make is trying to parse out a sub-form.
It just can't be done. The entire set of forms (and in order)
are necessary to get even one little piece of encapsulated data.

I didn't bench the code, its probably fairly quick.
One thing I can say is that it will work on any markup given
the included forms. Remember its not validating and quitting on
error, but it does re-align the stream and continue.

Let me know how it works in your case (errors, inconsistent, etcc).
Good luck!

-sln

# add_refs_to_content.pl
# - sln, 2/2010
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# - Util to create general reference's 
# from a character class. Does content only.
# Can do attribute values with a little more
# cut and paste .. not needed.
# -------------------------------------------
use strict;
use warnings;

my (
  $Name,
  $Rxmarkup,
  $Rxent
);
Initregex();

##
my $html = join '', <DATA>;

my $newhtml = ParseAndMakeEnt(\$html);
if ($$newhtml ne $html) {
	print "\nFixed markup:\n",$$newhtml,"\n";
}
else {
	print "\nNothing to fix!\n";
}
exit (0);


##
sub ParseAndMakeEnt
{
	my ($markup) = @_;
	my (
	  $MarkupNew,
	  $content,
	  $lcbpos,
	  $last_content_pos,
	  $begin_pos
	) = ('','',0,0,0);

	## parse loop
	while ($$markup =~ /$Rxmarkup/g)
	{
		## handle content buffer
		  if (defined $1) {
			## speed it up
			$content .= $1;
			if (length $2)
			{
				if ($lcbpos == pos($$markup))	{
					$content .= $2;
				} else {
					$lcbpos = pos($$markup);
					pos($$markup) = $lcbpos - 1;
				}
			}
			$last_content_pos = pos($$markup);
			next;
		  }
		## content here ... take it off
		  if (length $content)
		  {
			$begin_pos = $last_content_pos;
			## check '<'
			if ($content =~ /</) {
				## markup  in content
				print "Markup '<' in content, da stuff is crap!\n";
			}
			$MarkupNew .= ${_entconv(\$content)};
			$content = '';
		  }
		## markup here ... take it off 
		  $MarkupNew .= substr ($$markup, $begin_pos, pos($$markup)-$begin_pos);
		$begin_pos = pos($$markup);

	} ## end parse loop

	## check for leftover content
	if (length $content)
	{
		## check '<'
		if ($content =~ /</) {
			## markup  in content
			print "Markup '<' in left over content, da stuff is crap!\n";
		}
		$MarkupNew .= ${_entconv(\$content)};
	}
	return \$MarkupNew;
}

sub _entconv
{
	my ($strref) = @_;
	my ($buf,$lbufpos) = ('',0);

	while ($$strref =~ /$Rxent/g) {
		if (defined $3) {
			$buf .= $3;
			if (length $4) {
				if ($lbufpos == pos($$strref)) {
					$buf .= $4;
				} else {
					$lbufpos = pos($$strref);
					pos($$strref) = $lbufpos - 1;
				}
			}
			next;
		}
		if (defined $2) {
			$buf .= '&#'.ord($2).';';
		}
		if (defined $1) {
			$buf .= $1;
		}
	}
	return \$buf;
}

sub Initregex
{
  my @UC_Nstart = (
    "\\x{C0}-\\x{D6}",
    "\\x{D8}-\\x{F6}",
    "\\x{F8}-\\x{2FF}",
    "\\x{370}-\\x{37D}",
    "\\x{37F}-\\x{1FFF}",
    "\\x{200C}-\\x{200D}",
    "\\x{2070}-\\x{218F}",
    "\\x{2C00}-\\x{2FEF}",
    "\\x{3001}-\\x{D7FF}",
    "\\x{F900}-\\x{FDCF}",
    "\\x{FDF0}-\\x{FFFD}",
    "\\x{10000}-\\x{EFFFF}",
  ); 
  my @UC_Nchar = (
    "\\x{B7}",
    "\\x{0300}-\\x{036F}",
    "\\x{203F}-\\x{2040}",
  );
  my $Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
  my $Nchar = "[\\w:.".join ('',@UC_Nchar).join ('',@UC_Nstart)."-]";
  $Name  = "(?:$Nstrt$Nchar*)";

  $Rxmarkup = qr/
  (?:
    <
    (?:
        (?: \/* $Name \s* \/*)
       |(?: $Name (?:\s+(?:".*?"|'.*?'|[^>]*?)+) \s* \/?)
       |(?: \?.*?\?)
       |(?:
          !
          (?:
              (?: DOCTYPE.*?)
             |(?: \[CDATA\[.*?\]\])
             |(?: --.*?--)
             |(?: \[[A-Z][A-Z\ ]*\[.*?\]\]) # who knows?
             |(?: ATTLIST.*?)
             |(?: ENTITY.*?)
             |(?: ELEMENT.*?)
               # add more if necessary
          )
       )
    )
    >
  ) | ([^<]*)(<?)/xs;

  my $Refchars = quotemeta('#$\\');  # These are the char's to make references from
  $Rxent = qr/
     ([&%](?:$Name|\#(?:[0-9]+|x[0-9a-fA-F]+));)
    |([$Refchars])
    |([^&%$Refchars]*)([&%]?)
  /x;
}

__DATA__

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 # $ \ Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D "MSHTML 6.00.2900.3395" name=3DGENERATOR>

<STYLE></STYLE>
<test name = " thi<s # $ \ is a " test>
</HEAD>
<BODY bgColor=3D#ffffff>

  should fix these: # $ \
  but not these:    &#21; &#xAF;
  fix some here:    &&%#$  &as; &&#a0

  <IMG SRC = "foo.gif" ALT = "A > B">
  <IMG SRC = "foo.gif"
       ALT = "A > # $ \ B">
  <!-- <A comment # $ \ > -->
  <NN & a # $ \>
  <AA &  # $ \>

  <# Just data #>

  <![INCLUDE CDATA [ >>>>>\\ # $ \ >>>>>>> ]]>

  <!-- This section commented out.
     <B>You can't # $ \ see me!</B>
   -->

at root # $ \ > # $ \ level




------------------------------

Date: Sat, 13 Feb 2010 09:58:20 +0100
From: Helmut Richter <hhr-m@web.de>
Subject: Re: know-how(-not) about regular expressions
Message-Id: <Pine.LNX.4.64.1002130958060.4135@lxhri01.lrz.lrz-muenchen.de>

On Fri, 12 Feb 2010, Dr.Ruud wrote:

> Date: Fri, 12 Feb 2010 19:41:58 +0100
> From: Dr.Ruud <rvtol+usenet@xs4all.nl>
> Newsgroups: comp.lang.perl.misc
> Subject: Re: know-how(-not) about regular expressions
> 
> Helmut Richter wrote:
> 
> > [again parsing the wrong way]
> 
> Is there a newsgroup or mailing list that we can refer "them" to?
> I am sure that we are well past our monthly share already.
> 
> -- 
> Ruud
> 

-- 
Helmut Richter


------------------------------

Date: Sat, 13 Feb 2010 09:15:02 -0800
From: sln@netherlands.com
Subject: Re: know-how(-not) about regular expressions
Message-Id: <uomdn5dfui7bqur62tlmucsdrhitj4ml5e@4ax.com>

On Fri, 12 Feb 2010 15:22:20 -0800, sln@netherlands.com wrote:

>On Fri, 12 Feb 2010 12:40:14 +0100, Helmut Richter <hhr-m@web.de> wrote:
>
[snip]

>Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
>of thing.

[snip]

>>This version runs considerably slower, by a factor of three

[snip]

>I didn't bench the code, its probably fairly quick.

[snip]

I did bench the code on a 7 mbyte file  'mscore.xml'.
What really makes it slow on large files is the constant
"appending" to a variable. Its roughly 2 times + slower doing
it this way.

The fastest way to do it, is to write it to the disk as you
get it. Pass in a filehandle, or some other method.

Perl  would have to spend all its time on realloc() because
of all the appending.

-sln

>
># add_refs_to_content.pl
># - sln, 2/2010
># $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
># - Util to create general reference's 
># from a character class. Does content only.
># Can do attribute values with a little more
># cut and paste .. not needed.
># -------------------------------------------
>use strict;
>use warnings;
>
>my (
>  $Name,
>  $Rxmarkup,
>  $Rxent
>);
>Initregex();
>

 ...
>sub ParseAndMakeEnt
>{
  ...
>	while ($$markup =~ /$Rxmarkup/g)
>	{
>		## handle content buffer
>		  if (defined $1) {
>			## speed it up
>			$content .= $1;
>			if (length $2)
>			{
>				if ($lcbpos == pos($$markup))	{
>					$content .= $2;
>				} else {
>					$lcbpos = pos($$markup);
>					pos($$markup) = $lcbpos - 1;
>				}
>			}
>			$last_content_pos = pos($$markup);
>			next;
>		  }
>		## content here ... take it off
>		  if (length $content)
>		  {
>			$begin_pos = $last_content_pos;
>			## check '<'
>			if ($content =~ /</) {
>				## markup  in content
>				print "Markup '<' in content, da stuff is crap!\n";
>			}
>			$MarkupNew .= ${_entconv(\$content)};
                           ^^^^^^^^^^^^^^^^^^^^^^^^
 ->>   do this instead:     print $fh ${_entconv(\$content)};

>			$content = '';
>		  }
>		## markup here ... take it off 
>		  $MarkupNew .= substr ($$markup, $begin_pos, pos($$markup)-$begin_pos);
                           ^^^^^^^^^^^^^^^^^^^^^^^^
->>                  print $fh substr ($$markup, $begin_pos, pos($$markup)-$begin_pos);

>		$begin_pos = pos($$markup);
>
>	} ## end parse loop
>
>	## check for leftover content
>	if (length $content)
>	{
>		## check '<'
>		if ($content =~ /</) {
>			## markup  in content
>			print "Markup '<' in left over content, da stuff is crap!\n";
>		}
>		$MarkupNew .= ${_entconv(\$content)};
                     ^^^^^^^^^^^^^^^^^^^^^^^^
->>                 print $fh ${_entconv(\$content)};



------------------------------

Date: 13 Feb 2010 19:08:24 GMT
From: pastrufazio@gmail.com
Subject: Login using WWW::Mechanize
Message-Id: <4b76f8a8$0$1125$4fafbaef@reader3.news.tin.it>

Hi all,

I'm pretty new to the perl language, so please excuse me in advance for 
my lacking skill!

SCENARIO: I want to download some pdf from a website using my account 
using login/password. I found WWW::Mechanize that apparently is exactly 
what I need.

The usage example is clear:

    use WWW::Mechanize;
    use CGI;

    my $cgi = CGI->new();
    my $form = $cgi->Vars;

    my $agent = WWW::Mechanize->new();

    $agent->get('http://www.livejournal.com/login.bml');
    $agent->form_number('3');
    $agent->field('user',$form->{user});
    $agent->field('password',$form->{password});
    $agent->submit();



How have I to set username and password in the code above if the source 
of my web page like the following?

Thank you very much in advance.


  
<form name="form1" method="post" action="login.aspx" 
onsubmit="javascript:return WebForm_OnSubmit();" id="form1">

[cut]

<tr>
<td align="right"><label for="Login1_UserName">Email:</label></
td><td><input name="Login1$UserName" type="text" id="Login1_UserName" /
><span id="Login1_UserNameRequired" title="è richiesto uno username" 
style="color:Red;visibility:hidden;">*</span>
</td>
</tr>

<tr>
<td align="right"><label for="Login1_Password">Password:</label></
td><td><input name="Login1$Password" type="password" 
id="Login1_Password" /><span id="Login1_PasswordRequired" title="A 
password is needed" style="color:Red;visibility:hidden;">*</span>
</td>
</tr>

<tr>
<td align="right" colspan="2"><input type="submit" name="Login1
$LoginButton" value="Enter" 
onclick="javascript:WebForm_DoPostBackWithOptions(new 
WebForm_PostBackOptions(&quot;Login1$LoginButton&quot;, &quot;&quot;, 
true, &quot;Login1&quot;, &quot;&quot;, false, false))" 
id="Login1_LoginButton" />
</td>
</tr>


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2816
***************************************


home help back first fref pref prev next nref lref last post