[16594] in Perl-Users-Digest
Perl-Users Digest, Issue: 4006 Volume: 9
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Aug 14 03:05:31 2000
Date: Mon, 14 Aug 2000 00:05:15 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <966236714-v9-i4006@ruby.oce.orst.edu>
Content-Type: text
Perl-Users Digest Mon, 14 Aug 2000 Volume: 9 Number: 4006
Today's topics:
Re: CHOMP not working <bean@agentkhaki.com>
Re: DBM Files Going Haywire!!! HELP!!! <bwalton@rochester.rr.com>
dr watson <haggi@tappe.net>
Re: GET HEALTHY 3249 <x@x.com>
grep() optimisation (fvw)
Re: grep() optimisation (Thorfinn)
Re: grep() optimisation (fvw)
Re: grep() optimisation (Logan Shaw)
Re: grep() optimisation (Logan Shaw)
Re: grep() optimisation (Andrew Johnson)
Re: grep() optimisation (fvw)
Re: grep() optimisation (fvw)
Re: grep() optimisation <uri@sysarch.com>
Re: matching starting html code <godzilla@stomp.stomp.tokyo>
Re: Negativity in Newsgroup -- Solution (Steve Leibel)
Re: Pattern Matching Question <bwalton@rochester.rr.com>
Re: Pattern Matching Question (Abigail)
Pb with regular expression plz help ghorghor@my-deja.com
Perl for Palm? <mnysurf@home.comREMOVE>
Re: Perl script? (Abigail)
Re: pipes <jhijas@yahoo.es>
Re: pipes <jhijas@yahoo.es>
Re: Procmail vs Perl. <juex@my-deja.com>
Re: rmdir works with Xitami/w95 but not with MS-IIS/NT <x@x.com>
Re: Searching for errant modules (Abigail)
Re: Setting file user and group ids (Abigail)
Re: tag parsing. johnvert@my-deja.com
Re: utime Function Not Working <lr@hpl.hp.com>
Digest Administrivia (Last modified: 16 Sep 99) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 14 Aug 2000 04:24:17 GMT
From: bean <bean@agentkhaki.com>
Subject: Re: CHOMP not working
Message-Id: <MPG.14013f3bb6b48ea6989696@news.concentric.net>
In article <399761FE.DACA6557@mail.com>, vrillusions@mail.com says...
> Here is the source in question:
Maybe I missed something, but can you post your Perl source and not what
it outputs?
bean
------------------------------
Date: Mon, 14 Aug 2000 04:21:42 GMT
From: Bob Walton <bwalton@rochester.rr.com>
Subject: Re: DBM Files Going Haywire!!! HELP!!!
Message-Id: <39977405.77293129@rochester.rr.com>
mlmguy@my-deja.com wrote:
...
> I am new at Perl. I have created an FFA Links Page with 50 Links
> totally. I have tried storing the data in a text file but I dont like
> the results. I then tried to use DBM files using the following code. If
> you notice I have used a 'different' way of locking the process. The
> reason being I dont know any other way of locking. Anyway, now the
> logic of the new link being added on the top and the bottom one being
> deleted is working perfectly. However, strangely, my DBM data file
> keeps growing. It grew to a massive 4.5 MB in one day. But if I try to
> display the total number of records in it, it still shows 50. So what
> is the extra space holding??? Please HELP someone out there. Thanks in
> advance. Here is the code which I am using.
>
> #Start of Code
>
> $linkfilelock = "+< linkfilelock.cgi";
>
> open LINKLOCK, $linkfilelock;
You forgot the "or die..." on the above open.
> unless (flock (LINKLOCK, 2)) {
> print "Can't Lock! the linkfilelock.cgi\n";
> die "I am Dead!";
> }
>
> $link_database = "linkfile.dbm";
> %link;
>
> dbmopen (%link, $link_database, 0700) || die "cant open DBM";
>
> #tie %link, "Tie::IxHash";
>
> @values;
>
> for ($i=0;$i<50;$i++) {
> $values[$i] = $link{$i+1};
> }
>
> $newlink = "$title;$url;$email;$remote_addr";
>
> unshift @values, $newlink;
> $oldlink = pop @values;
>
> for ($i=0;$i<50;$i++) {
> $link{$i+1} = $values[$i];
> }
>
> dbmclose %link;
>
> close LINKLOCK;
>
> #End of Code
>
> Thanks again,
> mlmguy
...
Hmmmm...I don't see anything drastically wrong. You are not taking
advantage of the tied hash as well as you could for what you are doing,
but the above doesn't look like it should cause difficulties. You don't
say what you platform is, or which DBM-style database your copy of Perl
was compiled for as default (that is, which DBM-style database will be
used by dbmopen). I have used tied hashes extensively with NBDM on
Solaris, SDBM on NT with NTFS file system (note that SDBM doesn't work
on FAT or FAT32 file systems) and DB_File on Windoze 9x and NT, all with
zero problems, while using schemes that make what you are doing look
like nothing. You should check to be sure you aren't violating the
constaints of your DBM-style file (some DBM-style implementations have
length limitations on keys and values, for example -- although it looks
like it is unlikely you are reaching those limits). If the data you are
storing is coming from users, bear in mind that without warning, you
could receive humongo URL's etc.
You are basically keeping a set of 50 fixed keys, and overwriting the
values of all 50 keys many many times. While that is not what DBM-style
files are really all about, it shouldn't be a problem.
I wrote a quickie that wrote random strings from 1 to 200 characters
long into 50 integer keys (0..49) in a DB_File tied hash. After running
this program 10001 times, the file size was 40960 bytes. OS: Windoze
98 SE, filesystem FAT32, Perl 5.6, ActiveState build 616. You might try
a quickie like that and see if you can generate the humongo file program
that way. My quickie is:
dbmopen %h,"junk183.db",0666 or die "Couldn't open junk183.db, $!\n";
for(0..49){
$h{$_}=&randstr;
}
dbmclose %h;
sub randstr{
my $n;
my $s;
$n=int(rand(200));
$s='';
for my $i (0..$n){
$s.=chr(int(rand(95))+32);
}
$s;
}
I ran it repeatedly with:
for(0..10000){`perl junk183.pl`}
I looked at the data file with:
dbmopen %h,"junk183.db",0666 or die "Couldn't open junk183.db, $!\n";
for(sort keys %h){
print "h{$_}=$h{$_}\n";
}
dbmclose %h;
Hope that helps.
--
Bob Walton
------------------------------
Date: Mon, 14 Aug 2000 08:52:09 +0200
From: "haggi@work" <haggi@tappe.net>
Subject: dr watson
Message-Id: <39979719.C9C7BDF3@tappe.net>
hi,
I start an shell command with perl on WinNT(SP5) and check the output.
but sometimes this command produces an error with the "dr. watson" window.
Does anyone konw a way to catch this window or to avoid that NT pops up this
thing? It seems that the application is prevented from stooping until someone clicks
the "ok" button.
It would be great if my shell command would simly die or create a core dump (unix like).
haggi
--
---------------------------------------------------------
haggi
www.haggi.de
haggi@haggi.de
haggi`s visual effects & animation
---------------------------------------------------------
------------------------------
Date: Mon, 14 Aug 2000 04:19:41 GMT
From: x <x@x.com>
Subject: Re: GET HEALTHY 3249
Message-Id: <=aHLNiTHE5KrXfK5uO9NBjE=4hj6@4ax.com>
Your name is not Jennifer, you are a professional (male) spammer
named John The Ripper trying to promote that useless weight-loss site.
On Sun, 13 Aug 2000 08:54:39 GMT, ggdbfm@yahoo.com wrote:
>I lost a tone of weight using his personalized weight loss program and services directly over the internet. If you need to lose weight quickly and safely check him out. I am so excited about the new me I had to brag on his services.
>Sorry if I posted in the wrong area, I am new at this. If I did it wont happen again
>http://www.onlinefitnes.com
>jenniferpena24@hotmail.com
>
>dkpzipemwljwtpphskhgwdtyjxzrkqsdobxjjimzwrxkwlrgbcnllxcelmsfp
------------------------------
Date: Mon, 14 Aug 2000 05:35:23 GMT
From: fvw+usenet@var.cx (fvw)
Subject: grep() optimisation
Message-Id: <966231197VVC.fvw@var.cx>
According to the FAQ, the best way to make your code run faster
is optimize the algorithm, However I can't find any way to
make it run any faster, I'd appreciate any suggestions the group
might have:
I have an array @a, containing ~70 lines. I also have a string @b,
containing ~50 strings. I want to get an array (@c) containing the
lines of @a that match 6 colons (':'), followed by a string from @b
(case insensitive). I'm currently doing:
$b=join('|', @b);
@c=grep(/([^:]*:){6}.*($b)/i, @a);
but since I have to do this several times, it takes quite some time.
Anybody got any suggestions? TIA.
--
Frank v Waveren
fvw@var.cx
ICQ# 10074100
------------------------------
Date: 14 Aug 2000 05:50:49 GMT
From: thorfinn@netizen.com.au (Thorfinn)
Subject: Re: grep() optimisation
Message-Id: <slrn8pf25p.fro.thorfinn@netizen.com.au>
In comp.lang.perl.misc, on Mon, 14 Aug 2000 05:35:23 GMT
fvw <fvw+usenet@var.cx> wrote:
> According to the FAQ, the best way to make your code run faster
> is optimize the algorithm, However I can't find any way to
> make it run any faster, I'd appreciate any suggestions the group
> might have:
> I have an array @a, containing ~70 lines. I also have a string @b,
> containing ~50 strings. I want to get an array (@c) containing the
> lines of @a that match 6 colons (':'), followed by a string from @b
> (case insensitive). I'm currently doing:
> $b=join('|', @b);
> @c=grep(/([^:]*:){6}.*($b)/i, @a);
Urm... that doesn't match just six colons... but I assume you meant to
say "matches six colons, possibly with other characters in front of
each colon, followed by...".
> but since I have to do this several times, it takes quite some time.
> Anybody got any suggestions? TIA.
If the contents of @b never change, then you should probably put /o on
the end of your RE, so that it compiles it once and remembers it, and
also do the $b assignment somewhere outside of any loop you've got.
If @b *does* change, then there isn't much else you can do.
Oh, and just as an aside, you need to be sure that the strings in @b
don't contain RE metacharacters.
Ook,
Thorf
--
David Goh <thorfinn@netizen.com.au> --- http://netizen.com.au/
Internet and Open Source Development, Consulting and Training.
Netizen Pty Ltd, GPO Box 2265U, Melbourne VIC 3000, Australia.
Tel: +61 3 9614 0949 Mob: +61 411 692 516 Fax: +61 3 9614 0948
------------------------------
Date: Mon, 14 Aug 2000 06:15:34 GMT
From: fvw+usenet@var.cx (fvw)
Subject: Re: grep() optimisation
Message-Id: <966233370GGX.fvw@var.cx>
<slrn8pf25p.fro.thorfinn@netizen.com.au> (thorfinn@netizen.com.au):
>> @c=grep(/([^:]*:){6}.*($b)/i, @a);
>
>Urm... that doesn't match just six colons... but I assume you meant to
>say "matches six colons, possibly with other characters in front of
>each colon, followed by...".
Yup, that's what I meant, my bad.
>If the contents of @b never change, then you should probably put /o on
>the end of your RE, so that it compiles it once and remembers it.
Yay, 1 second off, 15 to go. (Don't we all want computation in 0 secs?
:-) ).
>and
>also do the $b assignment somewhere outside of any loop you've got.
Check, I was doing that.
>Oh, and just as an aside, you need to be sure that the strings in @b
>don't contain RE metacharacters.
None that weren't intended.
Thanks!
--
Frank v Waveren
fvw@var.cx
ICQ# 10074100
------------------------------
Date: 14 Aug 2000 01:25:37 -0500
From: logan@cs.utexas.edu (Logan Shaw)
Subject: Re: grep() optimisation
Message-Id: <8n83d1$18p$1@provolone.cs.utexas.edu>
In article <966231197VVC.fvw@var.cx>, fvw <fvw+usenet@var.cx> wrote:
>I have an array @a, containing ~70 lines. I also have a string @b,
>containing ~50 strings. I want to get an array (@c) containing the
>lines of @a that match 6 colons (':'), followed by a string from @b
>(case insensitive). I'm currently doing:
>
>$b=join('|', @b);
>@c=grep(/([^:]*:){6}.*($b)/i, @a);
>
>but since I have to do this several times, it takes quite some time.
If you're doing that whole segment of code several times, then
you'd be better off creating a pre-parsed regular expression so
that the regular expression doesn't have to be parsed every time:
$b = join ('|', @b);
$regex = qr/([^:]*:){6}.*($b)/i;
@c = grep ($regex, @a);
However, you're still matching every pattern against every string. If
you have n patterns and m strings, then your execution time will
essentially be proportional to n times m, and that's bad.
Your problem could probably be solved more efficiently by some sort of
text searching algorithm. For example, you might make a list of number
of occurences of each ascii character in each string and pattern. If a
pattern contains more of a certain character than a string does, then
that pattern can't match. Actually, there are many more sophisticated
algorithms than that. One web page that I just ran across seems
to have some good examples of text searching algorithms that might
get the mental juices flowing:
http://www.cs.fit.edu/wds/classes/algorithms/Text/text/text.html
Most text searching algorithms are going to deal with fixed patterns,
so they're not even going to touch on how regular expressions might be
slowing down your code. Your regular expression contains a ".*" right
before a big list of alternatives (separated by pipe symbols); the
regular expression matcher is going to have to try zero of any
character for the ".*" and then try every one of your patterns until it
determins they all fail at that point. Then, it will let the ".*"
match one character and see if that works. If that doesn't do it,
it'll try two, and so on until it runs out of string to match against.
You can see that that will be inefficient since it is trying the entire
list of alternatives at each starting position after every possible
match of the ".*". Actually, this description isn't quite right since
the "*" is greedy -- it'll try the longest string of any characters
first, then the next shortest until it gets to an empty string. So,
the order is different in real-world perl, but the concept is the
same.
So, having said that, if there is a way you could try to match the
fixed patterns first and only run the regular expression if the fixed
pattern matches (since the regular expression can't match if a
non-optional fixed subexpression doesn't match), you'd probably avoid
lots of this regular expression inefficiency.
Bringing this back to the practical, you might search CPAN to see if
there are any modules that implement fast text searching. If there
are, use one of them to eliminate certain patterns from consideration
as possible matches against each string. Once you have a list of which
patterns cannot match each string, you can run that regular expression
for that patterns that remain for each string, and that should be a lot
less work. (At least, it will be less work unless your patterns are
really short and the strings only have a few characters after the six
colons. In that case, the regular expressions won't be very
inefficient because there aren't very many possible matches for ".*".)
If you don't want to go to all that trouble, consider making your
regular expression more specific if it can. Doing that will actually
make it run faster since the regular expression matcher will have more
information about what won't match and will therefore spend less time
pursuing dead-end possibilities.
Of course, the ultimate would be to extract the exact string that might
or might not match the patterns and then turn the list of patterns into
keys of a hash so that you can just look up potential matches in the
hash instead of doing pattern matching. But that may not be possible
in your case.
Hope that helps.
- Logan
------------------------------
Date: 14 Aug 2000 01:40:50 -0500
From: logan@cs.utexas.edu (Logan Shaw)
Subject: Re: grep() optimisation
Message-Id: <8n849i$1eb$1@provolone.cs.utexas.edu>
In article <966233370GGX.fvw@var.cx>, fvw <fvw+usenet@var.cx> wrote:
>Yay, 1 second off, 15 to go. (Don't we all want computation in 0 secs?
>:-) ).
That's why I've written a special version of sleep()
that accepts negative arguments. Just pass it a -10
and it returns 10 seconds before you called it.
Of course, nobody will ever use it, since temporal anomalies
will lead to buggy programs -- if the function returns 10
seconds before you called it, then the memory you allocated 5
seconds ago to store the result of the function call hasn't
been allocated yet, and that's sure to cause problems.
>>Oh, and just as an aside, you need to be sure that the strings in @b
>>don't contain RE metacharacters.
>None that weren't intended.
If those patterns contain metacharacters, this can make the
regular expression even slower than it might already be. You
might want to have a look at some descriptions of regular
expressions and the non-deterministic finite state automata that
are used to implement them. A really good intro to this topic is
available at http://www.plover.com/~mjd/perl/Regex/article.html .
Once you understand how regular expressions do what they do, it's
much easier to understand why. In particular, I think (although
I'm not 100% sure) that your situation is similar to the one
described near the end of the "Lies" section, i.e. the part where
it describes how one kind of regex matcher can take hours where
another one might only take seconds in certain situations.
- Logan
------------------------------
Date: Mon, 14 Aug 2000 06:43:12 GMT
From: andrew-johnson@home.com (Andrew Johnson)
Subject: Re: grep() optimisation
Message-Id: <4yMl5.23969$k5.239494@news1.rdc1.mb.home.com>
In article <966233370GGX.fvw@var.cx>,
fvw <fvw+usenet@var.cx> wrote:
> <slrn8pf25p.fro.thorfinn@netizen.com.au> (thorfinn@netizen.com.au):
> >> @c=grep(/([^:]*:){6}.*($b)/i, @a);
> >
> >Urm... that doesn't match just six colons... but I assume you meant to
> >say "matches six colons, possibly with other characters in front of
> >each colon, followed by...".
>
> Yup, that's what I meant, my bad.
>
> >If the contents of @b never change, then you should probably put /o on
> >the end of your RE, so that it compiles it once and remembers it.
> Yay, 1 second off, 15 to go. (Don't we all want computation in 0 secs?
> :-) ).
anchoring your regex to the start of the string should certainly help:
@c=grep(/^([^:]*:){6}.*($b)/io, @a)
regards,
andrew
--
Andrew L. Johnson http://members.home.net/perl-epwp/
A closed mouth gathers no foot.
------------------------------
Date: Mon, 14 Aug 2000 06:43:22 GMT
From: fvw+usenet@var.cx (fvw)
Subject: Re: grep() optimisation
Message-Id: <966235756FIA.fvw@var.cx>
<4yMl5.23969$k5.239494@news1.rdc1.mb.home.com> (andrew-johnson@home.com):
>anchoring your regex to the start of the string should certainly help:
>
> @c=grep(/^([^:]*:){6}.*($b)/io, @a)
dang, I should have thought of that. 6 sec left to go!
--
Frank v Waveren
fvw@var.cx
ICQ# 10074100
------------------------------
Date: Mon, 14 Aug 2000 06:47:36 GMT
From: fvw+usenet@var.cx (fvw)
Subject: Re: grep() optimisation
Message-Id: <966235911JED.fvw@var.cx>
<8n849i$1eb$1@provolone.cs.utexas.edu> (logan@cs.utexas.edu):
>In article <966233370GGX.fvw@var.cx>, fvw <fvw+usenet@var.cx> wrote:
>>Yay, 1 second off, 15 to go. (Don't we all want computation in 0 secs?
>>:-) ).
>
>That's why I've written a special version of sleep()
>that accepts negative arguments. Just pass it a -10
>and it returns 10 seconds before you called it.
>
>Of course, nobody will ever use it, since temporal anomalies
>will lead to buggy programs -- if the function returns 10
>seconds before you called it, then the memory you allocated 5
>seconds ago to store the result of the function call hasn't
>been allocated yet, and that's sure to cause problems.
Not to mention the paradoxes you get when a process tries to kill()
the parent of it's parent :-).
>>>Oh, and just as an aside, you need to be sure that the strings in @b
>>>don't contain RE metacharacters.
>>None that weren't intended.
>
>If those patterns contain metacharacters, this can make the
>regular expression even slower than it might already be. You
>might want to have a look at some descriptions of regular
>expressions and the non-deterministic finite state automata that
>are used to implement them. A really good intro to this topic is
>available at http://www.plover.com/~mjd/perl/Regex/article.html .
>Once you understand how regular expressions do what they do, it's
>much easier to understand why. In particular, I think (although
>I'm not 100% sure) that your situation is similar to the one
>described near the end of the "Lies" section, i.e. the part where
>it describes how one kind of regex matcher can take hours where
>another one might only take seconds in certain situations.
Oh, interesting reading, just like the other link you posted..
This'll keep me busy till christmas, but still, thanks!
--
Frank v Waveren
fvw@var.cx
ICQ# 10074100
------------------------------
Date: Mon, 14 Aug 2000 07:01:27 GMT
From: Uri Guttman <uri@sysarch.com>
Subject: Re: grep() optimisation
Message-Id: <x73dk8z5q0.fsf@home.sysarch.com>
>>>>> "f" == fvw <fvw> writes:
f> I have an array @a, containing ~70 lines. I also have a string @b,
f> containing ~50 strings. I want to get an array (@c) containing the
f> lines of @a that match 6 colons (':'), followed by a string from @b
f> (case insensitive). I'm currently doing:
f> $b=join('|', @b);
f> @c=grep(/([^:]*:){6}.*($b)/i, @a);
f> but since I have to do this several times, it takes quite some time.
f> Anybody got any suggestions? TIA.
what is the .* matching? are there any patterns in @b you could reduce?
is your match string at the end of the data?
post some of the input and search strings.
if you can isolate the match sting part somehow, you could just match
the first part and then check if the second part is in a hash formed
from @b (untested):
@is_in_b{ @b } = () ;
@c = grep /(?:[^:]*:){6}.*(MATCH_B)/i && $is_in_b{$1}, @a);
put something in MATCH_B that will grab that string. you could deal with
case insensitivity with lc or uc as needed.
that is what is known as optimizing your algorithm to take advantage of
any patterns in your data.
uri
--
Uri Guttman --------- uri@sysarch.com ---------- http://www.sysarch.com
SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
The Perl Books Page ----------- http://www.sysarch.com/cgi-bin/perl_books
The Best Search Engine on the Net ---------- http://www.northernlight.com
------------------------------
Date: Sun, 13 Aug 2000 21:59:48 -0700
From: "Godzilla!" <godzilla@stomp.stomp.tokyo>
Subject: Re: matching starting html code
Message-Id: <39977CC4.69683183@stomp.stomp.tokyo>
JonB9@aol.com wrote:
> how can i match all the code from <html> to <body ...>???
> <html>[.\n]{0,}<head>[.\n]{0,}<\/head>[.\n]{0,}<body[.\n]{0,}>
You have not indicated clearly if you want <html>
and < body ...> included in your results. Your
regex appears you want this. Your 'from' indicates
immediately "after" <html> and your 'to' indicates
"to but not including" <body ...> for your results.
Be careful to both write and speak clearly and concisely.
"...from <html> to <body ...>, inclusive...."
or "all inclusive" if this strikes your fancy.
**
Chances are fair to middlin' you have not read this
type of syntax before. You will discover it is a good
practice to use 'index' and 'substring' when possible.
These are faster and more efficient than substitution
style regex or matching style regex.
One disadvantage is index is case sensitive. Perhaps
a day will come Mr. Wall will have enough sense to
include a switch to make index case insensitive.
Best I know, there is not a switch for this.
This is the main code of my test script below,
$new = $string;
$new =~ s/<BODY/<body/;
$new = substr ($new, 0, (${\index ($new, ">", ${\index ($new, "<body")})} + 1));
This variable $new is created simply to protect the
orginal string and is not always needed depending
if you need your original input string intact.
I have lowered cased <BODY so index will find it.
This is a disadvantage, having to include a regex
or transliteration so index can make a match. For
this, I chose to simply lower case one word with
a simple regex substitution. This does slow down
this script a bit. Again, a case insensitive
index function would be a wise inclusion for
new versions of Perl. Perhaps it does exist and
I simply cannot find documentation.
Personal choice on how to handle case for index.
Transliteration is quickest, =~ tr/A-Z/a-z/;
or whatever is needed for your script.
$new = substr ($new, 0, (${\index ($new, ">", ${\index ($new, "<body")})} + 1));
This is the meat. Careful, it should be one long
line. Word wrap may eat it.
Standard issue substring beginning at position zero,
grabbing a set number of characters following.
(${\index ($new, ">", ${\index ($new, "<body")})} + 1))
This section is a bit confusing, because of precedence
in operation appearing to be backwards. However, general
rule of mathematics can be applied, innermost parenthetical
comes first.
${\index ($new, "<body")}
This determines at what position index should BEGIN
to search for what is truly needed, a closing > for
your body tag. This skips over all other > save for
the one you need.
${\index ($new, ">",
This partial snippet tells substring where to STOP
grabbing characters. Another way of saying this,
'how many characters to grab' except...
} + 1)
you need to add one to include the last > for your
body tag. Without plus one, it will be excluded.
Do consider using substring and index, when you
can, in place of regex methods. Both are faster
and more efficient. Not so difficult once you
learn the syntax.
Incidently, this zero in,
($new, 0,
can also be an index function. You don't have
to start at zero with this type of coding.
Final thought is you also save on not having to
create place holder variables using this method.
You use functions but do not create variables,
not in the traditional sense, which use a little
extra memory, thus sacrificing some speed. Not
so important really if your script is small.
Godzilla!
TEST SCRIPT:
____________
#!/usr/local/bin/perl
print "Content-Type: text/plain\n\n";
$string = "<HTML><HEAD><TITLE><ROBOTS - META TAGS - GIBBERISH>
Kira's Nude Gallery</TITLE></HEAD><BODY whatever>more stuff<junk>
Mule Manure<plaintext crash>Kira Is Wise</body></HTML>";
print "INPUT:\n $string\n\n";
$new = $string;
$new =~ s/<BODY/<body/;
$new = substr ($new, 0, (${\index ($new, ">", ${\index ($new, "<body")})} + 1));
print "OUTPUT:\n $new";
exit;
PRINTED RESULTS:
________________
INPUT:
<HTML><HEAD><TITLE><ROBOTS - META TAGS - GIBBERISH>
Kira's Nude Gallery</TITLE></HEAD><BODY whatever>more stuff<junk>
Mule Manure<plaintext crash>Kira Is Wise</body></HTML>
OUTPUT:
<HTML><HEAD><TITLE><ROBOTS - META TAGS - GIBBERISH>
Kira's Nude Gallery</TITLE></HEAD><body whatever>
------------------------------
Date: Sun, 13 Aug 2000 21:39:31 -0700
From: stevel@bluetuna.com (Steve Leibel)
Subject: Re: Negativity in Newsgroup -- Solution
Message-Id: <stevel-1308002139310001@192.168.100.2>
In article <3997421D.94F8C927@attglobal.net>, care227@attglobal.net wrote:
> "Randal L. Schwartz" wrote:
> >
> >
> > You just are expected to follow Usenet tradition:
> >
> > 1) read the group for a week or two before your first posting
>
> one
>
> >
> > 2) be sure to do your homework before you post
>
> two
>
> >
> > 3) recognize that this is not a help desk: it's volunteers that are doing
> > this for free because they *want* to help
>
> three
>
> >
> > If the "newbie" (a term which I would find offensive when I am one :)
> > just did these two things
> ^^^
>
> Methinks the Schwartz is growing weary.
Not to mention an incorrect spelling of "its" in item #3. "It's" is a
contraction of "it is," which Randall would have known if he had bothered
to read the Oxford English Dictionary in its entirety before posting.
------------------------------
Date: Mon, 14 Aug 2000 04:38:58 GMT
From: Bob Walton <bwalton@rochester.rr.com>
Subject: Re: Pattern Matching Question
Message-Id: <39977812.39A85C6F@rochester.rr.com>
David Webb wrote:
...
> I am trying to extract all the links and their titles from a web page. As
> you
> know, it is not guaranteed that all the data would be on same line.
>
> I was trying this
>
> while ($main =~ /<a href="(.*)">(.*)</a>/igm) {
> print "$1, $2";
> }
>
> The problem is, if there are 10 links, it picks the data from first link to
> the end of 10th link, all in $1. What mistake I am doing? Please guide.
>
> Since I am a beginner so if you can suggest better code to extract all the
> links and their title, I'll appreciate it.
The real problem is that you aren't using the HTML::Parser module or
equivalent. Parsing HTML properly is a lot trickier than it looks at
first glance.
The specific problem you are having is that you are using the greedy
version of the * regular expression modifier. .* , for example, matches
the longest possible substring until whatever follows it appears. That
means that:
'abcabcabc'=~/^(.*)c/;
will return abcabcab in $1, for example -- you were obviously expecting
it to match ab . So your first .* matches everything until the last "
in the string. You could try the non-greedy version, which is .*? :
'abcabcabc'=~/^(.*?)c/;
which returns ab in $1.
Or you could write in the pattern you really want: Instead of .* , use
[^"]* for the first one, and [^<]* for the second one. But trust me:
HTML::Parser is what you *really* want.
--
Bob Walton
------------------------------
Date: 14 Aug 2000 06:37:02 GMT
From: abigail@foad.org (Abigail)
Subject: Re: Pattern Matching Question
Message-Id: <slrn8pf4rd.tj3.abigail@alexandra.foad.org>
David Webb (davidwebb@MailAndNews.com) wrote on MMDXL September MCMXCIII
in <URL:news:39B2C6E6@MailAndNews.com>:
== Hi All,
==
== I am trying to extract all the links and their titles from a web page. As
== you
== know, it is not guaranteed that all the data would be on same line.
==
== I was trying this
==
== while ($main =~ /<a href="(.*)">(.*)</a>/igm) {
== print "$1, $2";
== }
==
== The problem is, if there are 10 links, it picks the data from first link to
== the end of 10th link, all in $1. What mistake I am doing? Please guide.
Your mistake is that you can parse HTML with a trivial regex. You can't.
Use a parser.
Abigail
--
$_ = "\nrekcaH lreP rehtona tsuJ"; my $chop; $chop = sub {print chop; $chop};
$chop -> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> ()
-> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> ()
------------------------------
Date: Mon, 14 Aug 2000 06:52:12 GMT
From: ghorghor@my-deja.com
Subject: Pb with regular expression plz help
Message-Id: <8n84us$bug$1@nnrp1.deja.com>
Hello
I'm using this expression for finding and deletin an adresse store in
$adresse var
like: $adresse="http://www.nowhere.com/folders/files.zip"
this is the expression :
@fichier[10]=~s/((<br>)?)<a(\s?)href=("?)$adresse([^<]+)<\/a>//i;
but sometimes
i have adresse like this: http://www.nowhere.com/folders/files[1].zip
and with theses adresses that doesn't work
because the [ car are interpreted by the expression
plz could someone help me
Sent via Deja.com http://www.deja.com/
Before you buy.
------------------------------
Date: Mon, 14 Aug 2000 04:19:59 GMT
From: "Thomas" <mnysurf@home.comREMOVE>
Subject: Perl for Palm?
Message-Id: <PrKl5.78890$6y5.52226580@news2.rdc2.tx.home.com>
Can Perl be run on the Palm OS?
------------------------------
Date: 14 Aug 2000 06:38:46 GMT
From: abigail@foad.org (Abigail)
Subject: Re: Perl script?
Message-Id: <slrn8pf4ul.tj3.abigail@alexandra.foad.org>
Lovena Harwood (lharwoodNOlhSPAM@lucent.com.invalid) wrote on MMDXXXV
September MCMXCIII in <URL:news:0aa5d029.0f452258@usw-ex0104-026.remarq.com>:
??
?? There is a wonderful site I just found today called "Aloha,
?? write you message in the Hawaiian sand". There are fields for
?? entering text and the resulting image with your message is
?? emailed to a recipient of your choice.
??
?? I am trying to locate a script that does this (I'd be using a
?? different background and have it hosted) and would like to know
?? if anyone may know of a source.
And this has exactly what to do with Perl?
Abigail
--
perl -we 'eval {die ["Just another Perl Hacker\n"]}; print ${${@}}[$#{@{${@}}}]'
------------------------------
Date: Mon, 14 Aug 2000 08:50:23 +0200
From: Javier Hijas <jhijas@yahoo.es>
Subject: Re: pipes
Message-Id: <399796AF.AF576165@yahoo.es>
Martien Verbruggen answered me:
In comp.lang.perl.misc, you wrote:
> I am traslating a shell script where you can find things like this:
>
> rsh $HOMESERVER "cp /home/EMBL/stdprofile /home/${acc_name}/.profile"
> rsh $HOMESERVER "cp /home/EMBL/stdkshrc /home/${acc_name}/.kshrc"
> rsh $HOMESERVER "cp /home/EMBL/stdlogin /home/${acc_name}/.login"
> rsh $HOMESERVER "cp /home/EMBL/stdcshrc /home/${acc_name}/.cshrc"
> rsh $HOMESERVER "cp /home/EMBL/stdexrc /home/${acc_name}/.exrc"
> rsh $HOMESERVER "cp /home/EMBL/stdelmrc /home/${acc_name}/.elm/elmrc"
Well... Even for a shell script, that's pretty dumb..
rsh $HOMESERVER <<EOF
mkdir -p /home/${acc_name}/.elm
cp /home/EMBL/stdprofile /home/$acc_name/.profile
cp /home/EMBL/stdkshrc /home/$acc_name/.kshrc
cp /home/EMBL/stdlogin /home/$acc_name/.login
cp /home/EMBL/stdcshrc /home/$acc_name/.cshrc
cp /home/EMBL/stdexrc /home/$acc_name/.exrc
cp /home/EMBL/stdelmrc /home/$acc_name/.elm/elmrc
EOF
or even
rsh $HOMESERVER <<EOF
cd /home/EMBL
mkdir -p /home/${acc_name}/.elm
cp stdprofile stdkshrc stdlogin stdcshrc stdexrc /home/${acc_name}
cp stdelmrc /home/$acc_name/.elm
chmod -R ${acc_name}:$acc_group /home/$acc_name
EOF
Or.. You could put all of this stuff in a script on the remote server,
and just do
rsh $HOMESERVER /path/to/script
or even better still: Make sure EMBL looks exactly as you want the other
one to look, and do an cp -r, or even rcp from this host to that one, or
use rdist.
> which makes the script really slow. Now in perl I want to avoid this
> using only one rsh with a pipe to introduce all the commands. I tried
it
> in this way:
Not going to be any faster. For this sort of thing shell is just a
better tool. You just need to know how to use it.
> open(HSERVER,"|rsh pc-cg13") || die "can't fork: $!";
> print HSERVER "mkdir test\n" || print "print did'n work";
You don't want to do this. Honestly. If you want to work on a remote
host, look at Net::Telnet
> Perl doesn't complaint at all, but it does not create a dir called
test.
> I checkd all the permisions.
Why would perl complain? Perl succeeds in printing, most likely.
> Another problem I have doing this is that I don't know how could I get
> the error messages from the commands I input to the rshell.
If you use Net::Telnet, it'll be easier. But I wouldn't bother. rewrite
the scripts to be more speedy, and not to repeat everything 600 times.
I hope you're not doing this stuff as root. Allowing rsh access via
rhosts and hosts.equiv files is dangerous. Consider using ssh.
Martien
--
Martien Verbruggen |
Interactive Media Division | "In a world without fences,
Commercial Dynamics Pty. Ltd. | who needs Gates?"
NSW, Australia
------------------------------
Date: Mon, 14 Aug 2000 08:57:07 +0200
From: Javier Hijas <jhijas@yahoo.es>
To: mgjv@tradingpost.com.au
Subject: Re: pipes
Message-Id: <39979843.58675E0B@yahoo.es>
Thanks Martien,
you give me several ways to solve my problem. Though I can try any of
them, I was looking for something like:
rsh $HOMESERVER <<EOF
....
So I try it, but it looks like rsh doesn't allows pipes, it ignores
everything I type until EOF.
Javier Hijas wrote:
>
> Martien Verbruggen answered me:
>
> In comp.lang.perl.misc, you wrote:
> > I am traslating a shell script where you can find things like this:
> >
> > rsh $HOMESERVER "cp /home/EMBL/stdprofile /home/${acc_name}/.profile"
> > rsh $HOMESERVER "cp /home/EMBL/stdkshrc /home/${acc_name}/.kshrc"
> > rsh $HOMESERVER "cp /home/EMBL/stdlogin /home/${acc_name}/.login"
> > rsh $HOMESERVER "cp /home/EMBL/stdcshrc /home/${acc_name}/.cshrc"
> > rsh $HOMESERVER "cp /home/EMBL/stdexrc /home/${acc_name}/.exrc"
> > rsh $HOMESERVER "cp /home/EMBL/stdelmrc /home/${acc_name}/.elm/elmrc"
>
> Well... Even for a shell script, that's pretty dumb..
>
> rsh $HOMESERVER <<EOF
> mkdir -p /home/${acc_name}/.elm
> cp /home/EMBL/stdprofile /home/$acc_name/.profile
> cp /home/EMBL/stdkshrc /home/$acc_name/.kshrc
> cp /home/EMBL/stdlogin /home/$acc_name/.login
> cp /home/EMBL/stdcshrc /home/$acc_name/.cshrc
> cp /home/EMBL/stdexrc /home/$acc_name/.exrc
> cp /home/EMBL/stdelmrc /home/$acc_name/.elm/elmrc
> EOF
>
> or even
> rsh $HOMESERVER <<EOF
> cd /home/EMBL
> mkdir -p /home/${acc_name}/.elm
> cp stdprofile stdkshrc stdlogin stdcshrc stdexrc /home/${acc_name}
> cp stdelmrc /home/$acc_name/.elm
> chmod -R ${acc_name}:$acc_group /home/$acc_name
> EOF
>
> Or.. You could put all of this stuff in a script on the remote server,
> and just do
>
> rsh $HOMESERVER /path/to/script
>
> or even better still: Make sure EMBL looks exactly as you want the other
> one to look, and do an cp -r, or even rcp from this host to that one, or
> use rdist.
>
> > which makes the script really slow. Now in perl I want to avoid this
> > using only one rsh with a pipe to introduce all the commands. I tried
> it
> > in this way:
>
> Not going to be any faster. For this sort of thing shell is just a
> better tool. You just need to know how to use it.
>
> > open(HSERVER,"|rsh pc-cg13") || die "can't fork: $!";
> > print HSERVER "mkdir test\n" || print "print did'n work";
>
> You don't want to do this. Honestly. If you want to work on a remote
> host, look at Net::Telnet
>
> > Perl doesn't complaint at all, but it does not create a dir called
> test.
> > I checkd all the permisions.
>
> Why would perl complain? Perl succeeds in printing, most likely.
>
> > Another problem I have doing this is that I don't know how could I get
> > the error messages from the commands I input to the rshell.
>
> If you use Net::Telnet, it'll be easier. But I wouldn't bother. rewrite
> the scripts to be more speedy, and not to repeat everything 600 times.
>
> I hope you're not doing this stuff as root. Allowing rsh access via
> rhosts and hosts.equiv files is dangerous. Consider using ssh.
>
> Martien
> --
> Martien Verbruggen |
> Interactive Media Division | "In a world without fences,
> Commercial Dynamics Pty. Ltd. | who needs Gates?"
> NSW, Australia
------------------------------
Date: Sun, 13 Aug 2000 22:47:33 -0700
From: "Jürgen Exner" <juex@my-deja.com>
Subject: Re: Procmail vs Perl.
Message-Id: <399787f7@news.microsoft.com>
"Tony L. Svanstrom" <tony@svanstrom.com> wrote in message
news:1efbmse.1s7p8t2171w2glN%tony@svanstrom.com...
> Ah, now we're getting to the interesting stuff... Exactly what is it
> that Procmail is doing that I can't do with Perl, when it comes to
> writing to the mailbox?
Nothing. Of course you could re-implement procmail in Perl,
The only question is why would you want to?
jue
------------------------------
Date: Mon, 14 Aug 2000 04:15:18 GMT
From: x <x@x.com>
Subject: Re: rmdir works with Xitami/w95 but not with MS-IIS/NT
Message-Id: <HaDLNul+IMUJq1u25aky1ALO+tW9@4ax.com>
If you want to delete directory "dir3"
For UNIX-based server try
system("rm -r /dir1/dir2/dir3");
be careful, "rm -r /" will be judgement day for your UNIX server >-)
For W9x/NT-based server try
system("deltree /Y \dir1\dir2\dir3"); X-D
do not try deltree /Y \ ...
On Sun, 13 Aug 2000 20:01:09 GMT, "bjg" <bGhassemlou@home.com> wrote:
>My developped several Perl programs and it was tested and was working under
>xitami webserver/Win95. Now I am installing the same code under
>MS-IIS/Win-NT and rmdir does not work any more and does not remove the
>corresponding directory
>hereis the code!
>can anyone help?
>
>Thanks for suggestions.
>
>opendir(DIR, $some_dir) || die "can't opendir $some_dir: $!";
>$thisfile = readdir(DIR);
>print "Deleting:$some_dir/$thisfile,***\n";
>while( $thisfile ne '' ) {
> print "Deleting:$some_dir/$thisfile,***\n";
> if ($thisfile ne '.' && $thisfile ne '..' ){
> if ( unlink("$some_dir/$thisfile") > 0) {
> print "delted:$thisfile ,".' unlink('." $some_dir/$thisfile )" }
> else {
> print "failed:$thisfile ,".' unlink('." $some_dir/$thisfile )" }
> }
> $thisfile = readdir(DIR);
> }
>closedir(DIR) ;
>rmdir($some_dir);
>
>P.S.
>also have problem with:
>
> -d $subdir It does not yield true when checking on a subdirectory?
>
>
------------------------------
Date: 14 Aug 2000 07:02:57 GMT
From: abigail@foad.org (Abigail)
Subject: Re: Searching for errant modules
Message-Id: <slrn8pf6c0.tj3.abigail@alexandra.foad.org>
Soren Andersen (soren@spmfoiler.removethat.wonderstorm.com) wrote on
MMDXXXV September MCMXCIII in <URL:news:8mqnli$ttf$2@slb0.atl.mindspring.net>:
;;
;; You need to pull the plug on your box (I know the very idea shakes you to
;; the core of your being) and get out and take a look at how human beings
;; actually live. You have forgotten.
*Plonk*
Abigail
--
perl -wle '$, = " "; sub AUTOLOAD {($AUTOLOAD =~ /::(.*)/) [0];}
print+Just (), another (), Perl (), Hacker ();'
------------------------------
Date: 14 Aug 2000 07:04:36 GMT
From: abigail@foad.org (Abigail)
Subject: Re: Setting file user and group ids
Message-Id: <slrn8pf6f4.tj3.abigail@alexandra.foad.org>
Jerry Preston (g-preston1@ti.com) wrote on MMDXXXVI September MCMXCIII in
<URL:news:3992F885.C3664AA1@ti.com>:
-:
-: How do you save/change a files user and group id?
perldoc -f chown
Abigail
--
perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'
------------------------------
Date: Mon, 14 Aug 2000 04:08:47 GMT
From: johnvert@my-deja.com
Subject: Re: tag parsing.
Message-Id: <8n7rcc$5sc$1@nnrp1.deja.com>
In article <SBIl5.17452$rd1.3870066@typhoon-news1.southeast.rr.com>,
"Philip Garrett" <philipg@atl.mediaone.net> wrote:
> <johnvert@my-deja.com> wrote in message
news:8n7hgn$v2g$1@nnrp1.deja.com...
> >
> > However, if each of these lines is split across each element of an
> > array, as in:
> >
> > @tlines=split(/\n/, $t);
> >
> > The regexp above will not work since parts of what I'm trying to
> > retrieve is in different elements of 'tlines'. So my question is:
how
> > would I parse such an array?
>
> ($data) = join( "", @tlines) =~ /\<T\>(.*)\<\/T\>/;
>
> That works for this small example, but if you want to do a lot of
parsing
> like this, you might want to check out the perldoc pages for
HTML::Parser
> and HTML::TokeParser.
>
> HTH,
> Philip
I want to use HTML::TokeParser, as it seems a lot more logical than
quick-n-dirty regexps. I tried the following:
$p=HTML::TokeParser->new("file");
while($line=$p->get_token("<T>"))
{
my $text=$p->get_trimmed_text;
print "$text\n";
}
Taken from the HTML::TokeParser examples. The problem is, the stuff
between <T> and </T> is a bunch of tags, and I simply want to retrieve
these tags -as is-. Example:
<T><f>f</f><d c="left"></d></T>
I want it to return: <f>f</f><d c="left"></d>
How can I achieve that with HTML::TokeParser? The above does not work.
Thanks a lot,
-- john
Sent via Deja.com http://www.deja.com/
Before you buy.
------------------------------
Date: Sun, 13 Aug 2000 22:49:14 -0700
From: Larry Rosler <lr@hpl.hp.com>
Subject: Re: utime Function Not Working
Message-Id: <MPG.14012831f428548798ac70@nntp.hpl.hp.com>
In article <l2Bl5.118$DT4.3528119@nnrp2.clara.net>,
newsgroups@ckeith.clara.net says...
> In article <8n56qa$etc$1@nnrp1.deja.com>, centelec@my-deja.com wrote:
> >On a Windows NT server, I uploaded a file and tried to timestamp it.
> >After the script, I checked the modified date of the file with -M and
> >its zero's. The code follows:
>
> Where's the -M check? 'nyway, -M is not the modification time of the file.
> Its the modification time of the file *since your program started in days*
>
> I have never understood why days is useful, but if you played with this file
> recently then it will probably be very small so don't forget to *3600 to
> see an 'in seconds' value. If that's a pain, go for stat() and get the time
> since epoch from that.
Last time I checked, there were more than 3600 seconds in a day. :-)
...
> print $path; or print "$path";
^^^^^^^
Don't show this quoting, even as an alternative. Stringifying what is
already a string is silly and distractng.
--
(Just Another Larry) Rosler
Hewlett-Packard Laboratories
http://www.hpl.hp.com/personal/Larry_Rosler/
lr@hpl.hp.com
------------------------------
Date: 16 Sep 99 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 16 Sep 99)
Message-Id: <null>
Administrivia:
The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc. For subscription or unsubscription requests, send
the single line:
subscribe perl-users
or:
unsubscribe perl-users
to almanac@ruby.oce.orst.edu.
| NOTE: The mail to news gateway, and thus the ability to submit articles
| through this service to the newsgroup, has been removed. I do not have
| time to individually vet each article to make sure that someone isn't
| abusing the service, and I no longer have any desire to waste my time
| dealing with the campus admins when some fool complains to them about an
| article that has come through the gateway instead of complaining
| to the source.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.
For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V9 Issue 4006
**************************************