[28321] in Perl-Users-Digest
Perl-Users Digest, Issue: 9685 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Sep 5 03:10:20 2006
Date: Tue, 5 Sep 2006 00:10:08 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Tue, 5 Sep 2006 Volume: 10 Number: 9685
Today's topics:
Re: Regexp slowdown <someone@example.com>
Re: Regexp slowdown chris.ritchie@gmail.com
Re: Regexp slowdown <1usa@llenroc.ude.invalid>
Re: Regexp slowdown chris.ritchie@gmail.com
Re: Regexp slowdown <uri@stemsystems.com>
Re: Regexp slowdown chris.ritchie@gmail.com
Re: Regexp slowdown <uri@stemsystems.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Tue, 05 Sep 2006 01:21:13 GMT
From: "John W. Krahn" <someone@example.com>
Subject: Re: Regexp slowdown
Message-Id: <di4Lg.3806$0k7.3413@clgrps13>
chris.ritchie@gmail.com wrote:
> I have a subroutine that does a rather extensive expression match (the
> expression is ~80 characters long). It's looking for this expression
> in strings limited to 500 characters.
>
> The script runs just fine on many input strings, then slows
> considerably without any discernable cause. My debug output looks
> like:
>
> Starting regexp match (string size 380)...
> <~5 seconds>
> Done.
>
> Starting regexp match (string size 381)...
> <~9 seconds>
> Done.
>
> Starting regexp match (string size 382)...
> <~14 seconds>
> Done.
>
> I might think that the string size is too large for the expression
> (though there's no reason it should be), but it matches previous
> strings of size close to 500 with ease. Also, when I limit the size to
> 100, it will have the same slow down around 70 or 80.
Without seeing the actual regular expression it is pretty hard to diagnose the
problem but Perl's regular expressions use backtracking which can cause loops
that take a very long time to run. See the "Backtracking" section of perlre:
perldoc perlre
particularly the WARNING at the end of the section.
> I would wonder about memory management because the entire script is a
> few thousand lines long, but there aren't many global variables. And
> it does not slow down elsewhere. The code is almost all subroutines
> with 'my' variables. I assume these variables are deallocated when the
> subroutine is done, please correct me if I'm wrong.
>
> What else could it be?
This page may provide some hints: http://www.ccl4.org/~nick/P/Fast_Enough/
Also the book "Mastering Regular Expressions"
http://www.oreilly.com/catalog/regex3/ explains in detail how to optimize
regular expressions.
John
--
use Perl;
program
fulfillment
------------------------------
Date: 4 Sep 2006 18:40:03 -0700
From: chris.ritchie@gmail.com
Subject: Re: Regexp slowdown
Message-Id: <1157420403.105549.257230@i42g2000cwa.googlegroups.com>
Thanks, I'll check out those docs. I may have narrowed it down, though
I don't know how to solve it.
Calling `ps -ly` tells me that the script never exceeds 6 megs of
memory. But the cpu/cpu% increases in step with the slowdown (from
below 10 to the mid-20s). Scheduling priority is at or around 99 for
most of the runtime.
Is this that the expression match becomes more intensive or the OS
stops caring about my process? I wonder if it's the latter because
like I said, when I limit the string to 100 characters the same
slowdown occurs.
The only way I would think it's the complexity of the expression is if
there's some suboptimal backtracking going on. I'll look into this...
John W. Krahn wrote:
> chris.ritchie@gmail.com wrote:
> > I have a subroutine that does a rather extensive expression match (the
> > expression is ~80 characters long). It's looking for this expression
> > in strings limited to 500 characters.
> >
> > The script runs just fine on many input strings, then slows
> > considerably without any discernable cause. My debug output looks
> > like:
> >
> > Starting regexp match (string size 380)...
> > <~5 seconds>
> > Done.
> >
> > Starting regexp match (string size 381)...
> > <~9 seconds>
> > Done.
> >
> > Starting regexp match (string size 382)...
> > <~14 seconds>
> > Done.
> >
> > I might think that the string size is too large for the expression
> > (though there's no reason it should be), but it matches previous
> > strings of size close to 500 with ease. Also, when I limit the size to
> > 100, it will have the same slow down around 70 or 80.
>
> Without seeing the actual regular expression it is pretty hard to diagnose the
> problem but Perl's regular expressions use backtracking which can cause loops
> that take a very long time to run. See the "Backtracking" section of perlre:
>
> perldoc perlre
>
> particularly the WARNING at the end of the section.
>
>
> > I would wonder about memory management because the entire script is a
> > few thousand lines long, but there aren't many global variables. And
> > it does not slow down elsewhere. The code is almost all subroutines
> > with 'my' variables. I assume these variables are deallocated when the
> > subroutine is done, please correct me if I'm wrong.
> >
> > What else could it be?
>
> This page may provide some hints: http://www.ccl4.org/~nick/P/Fast_Enough/
>
> Also the book "Mastering Regular Expressions"
> http://www.oreilly.com/catalog/regex3/ explains in detail how to optimize
> regular expressions.
>
>
>
> John
> --
> use Perl;
> program
> fulfillment
------------------------------
Date: Tue, 05 Sep 2006 02:32:44 GMT
From: "A. Sinan Unur" <1usa@llenroc.ude.invalid>
Subject: Re: Regexp slowdown
Message-Id: <Xns9834E5597AC35asu1cornelledu@127.0.0.1>
chris.ritchie@gmail.com wrote in
news:1157420403.105549.257230@i42g2000cwa.googlegroups.com:
[ Please do not top-post ]
> Thanks, I'll check out those docs. I may have narrowed it down,
> though I don't know how to solve it.
Well, neither do we, as you refuse to show this regex. Is it top secret
or something.
Please read the posting guidelines for this group to learn how you can
help yourself and help others help you.
> Is this that the expression match becomes more intensive or the OS
> stops caring about my process?
It is easy to write a stupid regex that will just waste time doing
unnecessary things.
From perldoc perlretut:
Most of the time, all this moving forward and backtracking happens
quickly and searching is fast. There are some pathological regexps,
however, whose execution time exponentially grows with the size of the
string. A typical structure that blows up in your face is of the form
/(a|b+)*/;
The problem is the nested indeterminate quantifiers. There are many
different ways of partitioning a string of length n between the "+"
and "*": one repetition with "b+" of length n, two repetitions with
the first "b+" length k and the second with length n-k, m repetitions
whose ...
[ TOFU snipped ]
Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
------------------------------
Date: 4 Sep 2006 20:30:53 -0700
From: chris.ritchie@gmail.com
Subject: Re: Regexp slowdown
Message-Id: <1157427053.892057.81570@74g2000cwt.googlegroups.com>
> [ Please do not top-post ]
As far as I know, this only applies when addressing parts of someone's
post. Aside from the 'thanks I'll check it out', my last post was an
addendum to the first post. Another piece of etiquette is reading all
relevant material before making a comment/criticism :-\
> Well, neither do we, as you refuse to show this regex. Is it top secret
> or something.
Unfortunately I cannot. The interpretation of regexes is pretty
straightforward. I have no problem analyzing the expression for
correctness and efficiency.
Though I was alarmed by John's mention of backtracking- I was afraid he
meant backtracking across previous expression matches or some
memory-intensive attempt at interpreter optimization by the perl folks.
After reading his suggested information, I found this is not the case.
What I am unfamiliar with is Perl memory management and Unix scheduling
- in practice. That was the direction of my second post.
> Please read the posting guidelines for this group to learn how you can
> help yourself and help others help you.
John seemed to understand pretty well. Sadly, you made no attempt to
answer any of my questions. You wrongly complained about my etiquette
and suggested I can't write regular expressions. Now who's wasting
peoples' time?
> It is easy to write a stupid regex that will just waste time doing
> unnecessary things.
If this were the case, it wouldn't have quickly matched to the hundreds
of string inputs I fed it before it started slowing down. Granted I
could have hit a special case that fools the regexp, this is unlikely
considering the number of inputs it works perfectly on.
------------------------------
Date: Mon, 04 Sep 2006 23:53:09 -0400
From: Uri Guttman <uri@stemsystems.com>
Subject: Re: Regexp slowdown
Message-Id: <x7y7sz6kbe.fsf@mail.sysarch.com>
>>>>> "cr" == chris ritchie <chris.ritchie@gmail.com> writes:
>> [ Please do not top-post ]
cr> As far as I know, this only applies when addressing parts of someone's
cr> post. Aside from the 'thanks I'll check it out', my last post was an
cr> addendum to the first post. Another piece of etiquette is reading all
cr> relevant material before making a comment/criticism :-\
no, it is part of this group's (and the history of usenet's)
etiquette. and the groups rules override any of yours when you come
here. top posting in any form is considered poor netiquette.
>> Well, neither do we, as you refuse to show this regex. Is it top secret
>> or something.
cr> Unfortunately I cannot. The interpretation of regexes is pretty
cr> straightforward. I have no problem analyzing the expression for
cr> correctness and efficiency.
then why are you asking for help on your regex? you probably should
write a book on them given your skills. oops, MRE (3rd ed!) is already
written.
cr> Though I was alarmed by John's mention of backtracking- I was afraid he
cr> meant backtracking across previous expression matches or some
cr> memory-intensive attempt at interpreter optimization by the perl folks.
cr> After reading his suggested information, I found this is not the case.
and your proof of that is?
cr> What I am unfamiliar with is Perl memory management and Unix scheduling
cr> - in practice. That was the direction of my second post.
why would unix scheduling have anything to do with how fast a regex
runs? if you think they are related you definitely need more help than
you can get here.
>> Please read the posting guidelines for this group to learn how you can
>> help yourself and help others help you.
cr> John seemed to understand pretty well. Sadly, you made no attempt to
cr> answer any of my questions. You wrongly complained about my etiquette
cr> and suggested I can't write regular expressions. Now who's wasting
cr> peoples' time?
given your original query with no regex posted, that is a reasonable
assumption. you are to blame for not posting real code and then being
asked for it. but i will predict you will turn on me too. such is the
way of those who wish answers but not to learn.
>> It is easy to write a stupid regex that will just waste time doing
>> unnecessary things.
cr> If this were the case, it wouldn't have quickly matched to the hundreds
cr> of string inputs I fed it before it started slowing down. Granted I
cr> could have hit a special case that fools the regexp, this is unlikely
cr> considering the number of inputs it works perfectly on.
unlikely but possible. and of course without seeing the reges nor the
inputs, no one but you can tell why it slows down. i suspect it is on
line 42 of the regex. remove the + (or is it a []?)
have a fine day,
uri
--
Uri Guttman ------ uri@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
------------------------------
Date: 4 Sep 2006 21:25:12 -0700
From: chris.ritchie@gmail.com
Subject: Re: Regexp slowdown
Message-Id: <1157430312.035324.155150@m73g2000cwd.googlegroups.com>
> no, it is part of this group's (and the history of usenet's)
> etiquette. and the groups rules override any of yours when you come
> here. top posting in any form is considered poor netiquette.
Where was I supposed to put my comments? There's not TOFU because I
wasn't replying to anything in his post. Granted, I could have deleted
the cc of the previous posts, but that's not something to get
up-in-arms about. There was no non-linear scanning required.
"Unappreciated followup styles are referred to as ``top-posting'',
``Jeopardy'' (because the answer comes before the question)." That's
the group rule. My post followed that rule as I had no response to
intersperse. I'm sorry I did not delete the cc. Happy?
> then why are you asking for help on your regex?
My last paragraph:
I would wonder about memory management because the entire script is a
few thousand lines long, but there aren't many global variables. And
it does not slow down elsewhere. The code is almost all subroutines
with 'my' variables. I assume these variables are deallocated when the
subroutine is done, please correct me if I'm wrong.
What else could it be?
What of that makes you think I'm worried about the syntax of my regex?
> you probably should
> write a book on them given your skills. oops, MRE (3rd ed!) is already
> written.
Okay.
> cr> Though I was alarmed by John's mention of backtracking- I was afraid he
> cr> meant backtracking across previous expression matches or some
> cr> memory-intensive attempt at interpreter optimization by the perl folks.
> cr> After reading his suggested information, I found this is not the case.
> and your proof of that is?
The 'backtracking' section of the perldoc on regex.
> why would unix scheduling have anything to do with how fast a regex
> runs? if you think they are related you definitely need more help than
> you can get here.
Oh lets see, if my process's quantum is small and priority is low then
it won't get completed very fast, now will it? If you don't understand
the relationship between a process and the os running it you probably
shouldn't be posting on this thread.
> given your original query with no regex posted, that is a reasonable
> assumption. you are to blame for not posting real code and then being
> asked for it.
I DON'T NEED HELP WITH THE SYNTAX OF MY REGEX. I assume shouting is
warranted when you've repeated yourself so many times. I appreciate
John mentioning a possible pitfall based on the particulars of Perl's
matching implementation. But everything I've directly asked has dealt
with the interpreter and/or how it interfaces with UNIX.
> but i will predict you will turn on me too. such is the
> way of those who wish answers but not to learn.
I'm glad you can be so sage and philosophical but you haven't given me
anything to learn. You've just complained that I haven't supplied
information that you don't need. I appreciate the fact that you are
willing to run my code and input rather than give nebulous answers
about the way Perl works. But it's the nebulous answers that I need.
> unlikely but possible. and of course without seeing the reges nor the
> inputs, no one but you can tell why it slows down. i suspect it is on
> line 42 of the regex. remove the + (or is it a []?)
Yes you can. Runtime isn't just a function of the expression. If I
had a mile-long stack frame I would see a performance hit. If Perl had
an ugly garbage collector and all of my previous subroutine variables
hadn't been deallocated I would see a performance hit. If Perl
parallelized regexes and forked many, many times I would see a
performance hit. If any generally intensive script required that the
coder set a lower 'nice' value I would see a performance hit.
This is what I asked for in the beginning, specified in all of my
subsequent posts, and keeps being ignored.
> --Perl Consulting,
Oh! now I understand.
------------------------------
Date: Tue, 05 Sep 2006 01:08:35 -0400
From: Uri Guttman <uri@stemsystems.com>
Subject: Re: Regexp slowdown
Message-Id: <x7u03m7ve4.fsf@mail.sysarch.com>
>>>>> "cr" == chris ritchie <chris.ritchie@gmail.com> writes:
>> no, it is part of this group's (and the history of usenet's)
>> etiquette. and the groups rules override any of yours when you come
>> here. top posting in any form is considered poor netiquette.
cr> Where was I supposed to put my comments? There's not TOFU because I
cr> wasn't replying to anything in his post. Granted, I could have deleted
cr> the cc of the previous posts, but that's not something to get
cr> up-in-arms about. There was no non-linear scanning required.
it doesn't matter. your choice was wrong for this group. you defend that
with the same weak logic that you use for the regex stuff. keep trying.
>> then why are you asking for help on your regex?
cr> My last paragraph:
cr> I would wonder about memory management because the entire script is a
cr> few thousand lines long, but there aren't many global variables. And
cr> it does not slow down elsewhere. The code is almost all subroutines
cr> with 'my' variables. I assume these variables are deallocated when the
cr> subroutine is done, please correct me if I'm wrong.
cr> What else could it be?
a faulty regex? a bug in line 42? shall i whip out the venerable
PSI::ESP module and divine the answer?
cr> What of that makes you think I'm worried about the syntax of my regex?
the syntax may be fine as it compiles (or so you claim). but if you
don't know how to spot the possible issues with a regex you won't
realize that your test data isn't comprehensive enough.
>> you probably should
>> write a book on them given your skills. oops, MRE (3rd ed!) is already
>> written.
cr> Okay.
cr> Though I was alarmed by John's mention of backtracking- I was afraid he
cr> meant backtracking across previous expression matches or some
cr> memory-intensive attempt at interpreter optimization by the perl folks.
cr> After reading his suggested information, I found this is not the case.
>> and your proof of that is?
cr> The 'backtracking' section of the perldoc on regex.
but without showing the regex and the slow data, you haven't proven
anything yet. you seem to keep missing this point. show the code and
data. otherwise you are really wasting all of our time. but of course
you don't mind wasting your own.
>> why would unix scheduling have anything to do with how fast a regex
>> runs? if you think they are related you definitely need more help than
>> you can get here.
cr> Oh lets see, if my process's quantum is small and priority is low then
cr> it won't get completed very fast, now will it? If you don't understand
cr> the relationship between a process and the os running it you probably
cr> shouldn't be posting on this thread.
wow. oh well, i will have to leave you now. this was the stupidest thing
you have said so far. regexes don't massively slow down just because of
some scheduling issues. they are compute bound in almost all cases
unless they have heat death regexes where they can suck the universe
dry. but as usual without any code we will just have to take your word
that all is fine.
>> given your original query with no regex posted, that is a reasonable
>> assumption. you are to blame for not posting real code and then being
>> asked for it.
cr> I DON'T NEED HELP WITH THE SYNTAX OF MY REGEX. I assume shouting is
cr> warranted when you've repeated yourself so many times. I appreciate
cr> John mentioning a possible pitfall based on the particulars of Perl's
cr> matching implementation. But everything I've directly asked has dealt
cr> with the interpreter and/or how it interfaces with UNIX.
well, you know best. track all that down. see if you can figure out why
a data dependent behavior would cause major operating system changes but
not be a case of just a poorly written regex.
and you keep confusing syntax and semantics. your regex syntax is
probably correct as you claim it compiles. (hey i said that
already. maybe you will learn it this time?) on the other hand, the
semantics are surely wrong since they will slow down based on the
data.
>> but i will predict you will turn on me too. such is the
>> way of those who wish answers but not to learn.
cr> I'm glad you can be so sage and philosophical but you haven't given me
cr> anything to learn. You've just complained that I haven't supplied
cr> information that you don't need. I appreciate the fact that you are
cr> willing to run my code and input rather than give nebulous answers
cr> about the way Perl works. But it's the nebulous answers that I need.
info i don't need? the code and the data? what are you babbling about??
nebulous answers are for nebulous questions and neither have a place in
this group. if you think the problem is with unix, then go to a unix
group. admit it, you know more perl than anyone here.
>> unlikely but possible. and of course without seeing the reges nor the
>> inputs, no one but you can tell why it slows down. i suspect it is on
>> line 42 of the regex. remove the + (or is it a []?)
cr> Yes you can. Runtime isn't just a function of the expression. If I
cr> had a mile-long stack frame I would see a performance hit. If Perl had
cr> an ugly garbage collector and all of my previous subroutine variables
cr> hadn't been deallocated I would see a performance hit. If Perl
cr> parallelized regexes and forked many, many times I would see a
cr> performance hit. If any generally intensive script required that the
cr> coder set a lower 'nice' value I would see a performance hit.
wow x 10.
cr> This is what I asked for in the beginning, specified in all of my
cr> subsequent posts, and keeps being ignored.
>> --Perl Consulting,
cr> Oh! now I understand.
no you don't. you won't. try banging your head against your
keyboard. you will be nebulous soon.
too bad i won't be consulting for you in the future. but you can't
handle the truth. :)
bye bye.
uri
--
Uri Guttman ------ uri@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 9685
***************************************