[16098] in Perl-Users-Digest
Perl-Users Digest, Issue: 3510 Volume: 9
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jun 29 14:05:58 2000
Date: Thu, 29 Jun 2000 11:05:24 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <962301924-v9-i3510@ruby.oce.orst.edu>
Content-Type: text
Perl-Users Digest Thu, 29 Jun 2000 Volume: 9 Number: 3510
Today's topics:
Re: ***Do not use this code!*** Re: Perl Help Please! (Randal L. Schwartz)
Re: ***Do not use this code!*** Re: Perl Help Please! <pap@NOTHEREsotonians.org.uk>
Re: ***Do not use this code!*** Re: Perl Help Please! (Randal L. Schwartz)
Re: ***Do not use this code!*** Re: Perl Help Please! <pap@NOTHEREsotonians.org.uk>
Re: ***Do not use this code!*** Re: Perl Help Please! <godzilla@stomp.stomp.tokyo>
2nd level of reference <sun_tong_001@yahoo.com>
Re: 2nd level of reference (Tad McClellan)
Re: 2nd level of reference <stephen.kloder@gtri.gatech.edu>
Re: 2nd level of reference (Colin Watson)
Re: 2nd level of reference perl_monkey@my-deja.com
array or hash when generating reports tgfree@my-deja.com
Re: array or hash when generating reports <stephen.kloder@gtri.gatech.edu>
Re: catch error SQL perl_monkey@my-deja.com
cgi.pm & parsing question <jtalbain@nospam.kimochi3d.com>
Re: cgi.pm & parsing question perl_monkey@my-deja.com
Re: cgi.pm & parsing question (Tad McClellan)
Re: explicit package name <abe@ztreet.demon.nl>
Finding out free disk space ??? <Roger.Tillmann@extern.rwso.de>
Help with hashes <cghansen@micron.com>
Re: Help with hashes perl_monkey@my-deja.com
Re: Help: regex for changing NON-absolute urls in a htm <jeffp@crusoe.net>
Re: Help: regex for changing NON-absolute urls in a htm <jbessels@planet.nl>
Re: Help: regex for changing NON-absolute urls in a htm <jeffp@crusoe.net>
Re: Help: regex for changing NON-absolute urls in a htm <care227@attglobal.net>
Digest Administrivia (Last modified: 16 Sep 99) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 29 Jun 2000 08:27:38 -0700
From: merlyn@stonehenge.com (Randal L. Schwartz)
Subject: Re: ***Do not use this code!*** Re: Perl Help Please!
Message-Id: <m1vgysmrrp.fsf@halfdome.holdit.com>
>>>>> "Paul" == Paul Taylor <pap@NOTHEREsotonians.org.uk> writes:
Paul> In my shaky defence, this is what the guy asked for - and the script
Paul> wasn't intended to be a complete package. It was submitted on the
Paul> assumption that other functionality would be provided by the intended
Paul> user.
OP: "I need something to light this cigarette."
You: "Here, take this flamethrower. It does the job."
What you did, sir, was unethical. Please don't do it again.
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
------------------------------
Date: Thu, 29 Jun 2000 16:43:02 +0100
From: Paul Taylor <pap@NOTHEREsotonians.org.uk>
Subject: Re: ***Do not use this code!*** Re: Perl Help Please!
Message-Id: <395B6E86.4D1A6995@NOTHEREsotonians.org.uk>
> OP: "I need something to light this cigarette."
> You: "Here, take this flamethrower. It does the job."
>
> What you did, sir, was unethical. Please don't do it again.
>>Nevertheless, I completely retract the code
I thought I made myself clear.
I have accepted the damaging nature of the original code in the
context in which it was originally posted.
I will endeavour to ensure that future posts ( given that they'll still
be
read following my ongoing character assassination ) will contain enough
information to prevent misuse from others.
At no time was there any intent to compromise the security of someone
else's
system - and the example code posted was given in the hope that it would
*help* the intended user.
I hope this clarifies my position and will satisfy the Perl community.
Pap.
------------------------------
Date: 29 Jun 2000 08:55:27 -0700
From: merlyn@stonehenge.com (Randal L. Schwartz)
Subject: Re: ***Do not use this code!*** Re: Perl Help Please!
Message-Id: <m1og4kmqhc.fsf@halfdome.holdit.com>
>>>>> "Paul" == Paul Taylor <pap@NOTHEREsotonians.org.uk> writes:
Paul> At no time was there any intent to compromise the security of
Paul> someone else's system - and the example code posted was given in
Paul> the hope that it would *help* the intended user.
Then you don't yet know enough to help without causing damage. Please
refrain from posting more CGI answers until you are clear that what
you are doing meets the basic standards of CGI security.
I understand you want to help. I appreciate that. But you don't know
enough to help yet, and you just clearly demonstrated that. Please
lurk some more. Read bugtraq. Read the Web and CGI security FAQs.
Or at least post "THERE MAY BE SECURITY HOLES IN THIS POST THAT COULD
LET A TRUCK DRIVE RIGHT THROUGH" at the stop of your posting.
What you did is tie up experts time both rebutting you *and* answering
the original poster. That's a pretty good way to drive the experts
away.
Let's not turn these newsgroups into "the blind leading the blind".
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
------------------------------
Date: Thu, 29 Jun 2000 16:53:05 +0100
From: Paul Taylor <pap@NOTHEREsotonians.org.uk>
Subject: Re: ***Do not use this code!*** Re: Perl Help Please!
Message-Id: <395B70E1.9CAC5AB5@NOTHEREsotonians.org.uk>
In case there is any confusion with this thread, I am not looking
to perpetuate this argument at all.
Bad code example + hands up + I'm sorry = problem over.
Pap.
------------------------------
Date: Thu, 29 Jun 2000 09:10:04 -0700
From: "Godzilla!" <godzilla@stomp.stomp.tokyo>
Subject: Re: ***Do not use this code!*** Re: Perl Help Please!
Message-Id: <395B74DC.22F2D813@stomp.stomp.tokyo>
Paul Taylor wrote:
Schwartz wrote:
Briles wrote:
O.J. Simpson wrote:
Obes Sive Geek wrote:
Kahn T. Est wrote:
Hannibal Lector wrote:
> > <Snipped perhaps the most dangerous code
> > I have ever seen posted to a Perl newsgroup,
> > in the hopes that it will only be archived once.>
I've read more dangerous code posted by experts
on a relatively regular basis.
> > It could *very, very* easily be used for evil purposes!
A type of activity practiced by a fair amount of
both regulars and experts posting to this group,
activities which are well documented by my site
log records for the past six months and well
documented by archived articles for this group.
> > To the poster of said code: *Submit a cancellation immediately*,
> > and we'll hope for the best.
Jeesshh... take a chill pill dude before you suffer
a cardiac arrest or suffer intense flatulence.
> > Geez...frickin' amazing.
I'll say...
(snip)
> Nevertheless, I completely retract the code and hope
> that I haven't upset too many people.
pffffttt... have fun with what is no more than
a type of human error we all make, many times.
## Harsh:
@Fry_Em = ("strnps", "getdoc", "fmttext", "../",
"rm -rf", "etc/passwd", "#ex", "#in",
"cmd=", "cgi=", "file=", "rm+", "rf+",
"%00" ...etc ...etc );
$env_agent = $ENV{HTTP_USER_AGENT};
$env_agent =~ tr/A-Z/a-z/;
foreach $fry_em (@Fry_Em)
{
$kill_em = index ($stupid_input, $fry_em);
if ($kill_em gt -1)
{
$linux = index ($env_agent, "inux");
if ($linux gt -1)
{ print "code very fatal to linux"; }
$nt = index ($env_agent, "nt");
if ($nt gt -1)
{ print "code very fatal to nt"; }
$win = index ($env_agent, "win");
if (($nt eq -1) & ($win gt -1))
{ print "code very fatal to win"; }
}
else
{ print "code very fatal to all browsers"; }
exit;
}
Is this any less dangerous than an unintentional
error, a type of error all make sooner or later?
## Gentle:
$stupid_input =~ s/[^a-zA-Z0-9 _ ..etc ..etc]//g;
Godzilla!
------------------------------
Date: 29 Jun 2000 12:26:12 -0300
From: * Tong * <sun_tong_001@yahoo.com>
Subject: 2nd level of reference
Message-Id: <sa8hfac1pbf.fsf@sun_tong_001.personal.yahoo.com>
Hi,
Let me ask my question by giving an example:
I used to use
print "[]$url_title\n $url_addr\n\n $url_intro\n\n" ;
to print my url_... variables. Now that they spread all over my
code, I now try to difine a signle variable:
my $myformat='[]$url_title\n $url_addr\n\n $url_intro\n\n' ;
My question is how to use it?
I tried the following, which are all not what I want:
print "$myformat";
print "@{[$myformat]}";
...
Thanks for your help
--
Tong (remove underscore(s) to reply)
http://members.xoom.com/suntong001/
- All free contribution & collection & music from the heavens
------------------------------
Date: Thu, 29 Jun 2000 10:48:02 -0400
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: 2nd level of reference
Message-Id: <slrn8lmod2.42g.tadmc@magna.metronet.com>
On 29 Jun 2000 12:26:12 -0300, * Tong * <sun_tong_001@yahoo.com> wrote:
>my $myformat='[]$url_title\n $url_addr\n\n $url_intro\n\n' ;
>
>My question is how to use it?
Perl FAQ, part 4:
"How can I expand variables in text strings?"
--
Tad McClellan SGML Consulting
tadmc@metronet.com Perl programming
Fort Worth, Texas
------------------------------
Date: Thu, 29 Jun 2000 11:53:02 -0400
From: Stephen Kloder <stephen.kloder@gtri.gatech.edu>
Subject: Re: 2nd level of reference
Message-Id: <395B70DD.9FBCC28B@gtri.gatech.edu>
* Tong * wrote:
> Hi,
>
> Let me ask my question by giving an example:
>
> I used to use
>
> print "[]$url_title\n $url_addr\n\n $url_intro\n\n" ;
>
> to print my url_... variables. Now that they spread all over my
> code, I now try to difine a signle variable:
>
> my $myformat='[]$url_title\n $url_addr\n\n $url_intro\n\n' ;
>
> My question is how to use it?
>
> I tried the following, which are all not what I want:
>
> print "$myformat";
> print "@{[$myformat]}";
> ...
>
> Thanks for your help
>
The problem is, as soon as you declare $myformat, it contains the values
of the interpolated variables at that time, and cannot update. Why not
use a subroutine?
sub print_format {
print "[]$url_title\n $url_addr\n\n $url_intro\n\n" ;
}
and call &print_format whenever appropriate.
------------------------------
Date: 29 Jun 2000 16:12:47 GMT
From: cjw44@flatline.org.uk (Colin Watson)
Subject: Re: 2nd level of reference
Message-Id: <8jfshv$etn$1@riva.ucam.org>
Stephen Kloder <stephen.kloder@gtri.gatech.edu> wrote:
>* Tong * wrote:
>> my $myformat='[]$url_title\n $url_addr\n\n $url_intro\n\n' ;
>
>The problem is, as soon as you declare $myformat, it contains the values
>of the interpolated variables at that time, and cannot update.
If you read the code a little more closely, those are single quotes ...
--
Colin Watson [cjw44@flatline.org.uk]
"Why would you make a better DPL than Wichert?
(Wichert, I'm particularly interested in your answer to this)"
- Anthony Towns, debian-vote
------------------------------
Date: Thu, 29 Jun 2000 16:13:02 GMT
From: perl_monkey@my-deja.com
Subject: Re: 2nd level of reference
Message-Id: <8jfshr$fn1$1@nnrp1.deja.com>
In article <sa8hfac1pbf.fsf@sun_tong_001.personal.yahoo.com>,
Tong <suntong001@yahoo.com> wrote:
> Hi,
>
> Let me ask my question by giving an example:
>
> I used to use
>
> print "[]$url_title\n $url_addr\n\n $url_intro\n\n" ;
>
> to print my url_... variables. Now that they spread all over my
> code, I now try to difine a signle variable:
>
> my $myformat='[]$url_title\n $url_addr\n\n $url_intro\n\n' ;
Bzzt! You want:
my $myformat = "[]$url_title\n $url_addr\n\n $url_intro\n\n";
I.e., use doublequotes, not single quotes. The difference is that in
perl, when you use doublequotes, it interpolates the value of the
variables.
For example, if you say, $foo = "foo";
and then say $foobar = "$foo" . "bar";
then $foobar will be "foobar". If you say
$foobar = '$foo' . "bar";
then $foobar will be "$foobar".
When using doublequotes, it puts the value of the variable in the
variables place. When using single quotes, it thinks you're using a
dollar sign and then just some random characters, and doesn't even see
it as a variable. You want double quotes.
> print "$myformat";
This is what you want, but just make it
print $myformat;
because it does the same thing and doesn't force the perl interpreter to
interpolate $myformat into another string.
> print "@{[$myformat]}";
This is wrong. $myformat isn't any array. I think what you meant to
say was @{$myformat} but that wouldn't have worked either.
Sent via Deja.com http://www.deja.com/
Before you buy.
------------------------------
Date: Thu, 29 Jun 2000 16:14:35 GMT
From: tgfree@my-deja.com
Subject: array or hash when generating reports
Message-Id: <8jfskn$fo7$1@nnrp1.deja.com>
Any help would be greatly appreciated.
I have a program that generates reports based on files in a certain
directory. These files are flat files with a \t delimiter.
Each file has an id, an ip address, product description and some info
that is used to generate the total price of the product based on the
number of products ordered.
I divided the contents of each file into array @piece.
$piece[0] has a user id.
$piece[11] has the number of products ordered.
the total amount by each user is stored in an array called @totalbyuser
which is push(@totalbyuser, ($piece[11] * $price))and I am using
@totalbyuser array in a foreach loop with a counter to print the $id
parallel to $totalbyusers[counter++]
Please see below.
My report is printing duplicates of ids and it is not generating the
total for each user. Here is what it prints:
id (number of orders) - price
($id ($numberoforders) - $totalbyuser[counter++]) #var names
lakeside (2) - $20
lakeside (2) - $20
nstar (3) - $20
amsr (1) - $20
nstar (3) - $20
nstar (3) - $20
I would like to print:
lakeside (2) - $40
nstar (3) - $60
amsr (1) - $20
This is what I have so far:
$counter = 0;
foreach $id (@other) {
$numberoforders = grep {$_ eq $id} @other; #counts number of orders
print "$id ($numberoforders) - $$totalbyusers[$counter++]\n";
}
Should I use a hash, arrays, or something else?
Thanks for any response. This is a major challenge and I nobody here at
work can figure it out. Good luck and again thanks for any suggestions.
Thiago
Sent via Deja.com http://www.deja.com/
Before you buy.
------------------------------
Date: Thu, 29 Jun 2000 12:38:33 -0400
From: Stephen Kloder <stephen.kloder@gtri.gatech.edu>
Subject: Re: array or hash when generating reports
Message-Id: <395B7B88.4CBA1E55@gtri.gatech.edu>
tgfree@my-deja.com wrote:
>
> My report is printing duplicates of ids and it is not generating the
> total for each user. Here is what it prints:
>
>
> Should I use a hash, arrays, or something else?
>
Whenever you are worried about duplicates, that's a good sign you need a
hash. Since you are worried most about duplicate id's you should use the
id as the hash key.
------------------------------
Date: Thu, 29 Jun 2000 16:19:48 GMT
From: perl_monkey@my-deja.com
Subject: Re: catch error SQL
Message-Id: <8jfsuf$g4p$1@nnrp1.deja.com>
In article <8jf8u0$ja$1@nnrp1.deja.com>,
eastking@my-deja.com wrote:
> my $sql = "select foo, bar from table where baz=?";
You could always just skip the prepare and just interpolate whatever
baz is into the SQL. You'll have to be careful though, since baz might
contain something that would screw up the SQL syntax. I.e.
# Make sure $baz doesn't contain "'"
$sql = "SELECT foo, bar from table where baz='$baz'";
$sth = $dbh->prepare($sql);
$sth->execute();
Sent via Deja.com http://www.deja.com/
Before you buy.
------------------------------
Date: Thu, 29 Jun 2000 23:35:12 +0800
From: "Kimochi3D" <jtalbain@nospam.kimochi3d.com>
Subject: cgi.pm & parsing question
Message-Id: <8jfpu6$qhl$1@coco.singnet.com.sg>
Greetings,
I'm designing my own version of formmail, but I intend to use the cgi.pm -
I'm wondering if after getting the necessary variables from the form into
the perl script, do I need to go though (parse or check) the variables again
to ensure they do not contain any bad unix commands?
I'm asking this because I am not familiar with "bad unix commands" :-)
Thanks for assistance! Perl newbie.
--
*remove nospam from email address before reply!*
Regards,
Alvin Yap
http://www.kimochi3d.com
------------------------------
Date: Thu, 29 Jun 2000 16:09:00 GMT
From: perl_monkey@my-deja.com
Subject: Re: cgi.pm & parsing question
Message-Id: <8jfsab$fi9$1@nnrp1.deja.com>
In article <8jfpu6$qhl$1@coco.singnet.com.sg>,
"Kimochi3D" <jtalbain@nospam.kimochi3d.com> wrote:
> Greetings,
> I'm designing my own version of formmail, but I intend to use the
cgi.pm -
> I'm wondering if after getting the necessary variables from the form
into
> the perl script, do I need to go though (parse or check) the variables
again
> to ensure they do not contain any bad unix commands?
>
> I'm asking this because I am not familiar with "bad unix commands" :-)
>
> Thanks for assistance! Perl newbie.
YES! You need to check them for bad UNIX commands, paricularly if you
ever use exec() system() or the backticks (like `ls $HOME`) in your
program with some user data.
This is a potentially big problem, but there are ways to get around it.
First, always use perl's taint flag, which is -T I think, when running
your scripts. Perl will warn you of possibly "tainted" data and tell
you that it might not be wise to use this in this spot, etc.
Also, you might want to do some rudimentary parsing. For stuff that is
going to end up getting interpreted by the shell (for whatever reason)
be extremely suspcious of backticks, dollar signs, and pipes in the
user's input. Basically, usually forms are supposed to be easy to fill
out, so if the user inputs any symbol other than maybe characters and
numbers, it's fishy. (And if it's not, then redesign your forms so that
it is)
If you want to be safe, then be EXTREMELY nazi about what you'll take as
valid input.
Sent via Deja.com http://www.deja.com/
Before you buy.
------------------------------
Date: Thu, 29 Jun 2000 12:28:24 -0400
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: cgi.pm & parsing question
Message-Id: <slrn8lmu98.43k.tadmc@magna.metronet.com>
On Thu, 29 Jun 2000 23:35:12 +0800, Kimochi3D <jtalbain@nospam.kimochi3d.com> wrote:
>I'm wondering if after getting the necessary variables from the form into
>the perl script, do I need to go though (parse or check) the variables again
>to ensure they do not contain any bad unix commands?
Yes.
Err, maybe.
It depends on what you are going to _do_
with the cracker's input.
Just print() it back out? No problem.
Just ignore it? No problem.
Do just about anything else with it? Problem. :-)
>I'm asking this because I am not familiar with "bad unix commands" :-)
If you don't use "Perl commands" that are dangerous, then
you don't need to worry about "unix commands".
If you do use dangerous Perl functions, then what is dangerous
under Unix should be discussed in a newsgroup about Unix.
What is "dangerous" in Perl is on-topic for this newsgroup though.
If you get input from outside of your program source code:
from files ( cookies, static HTML, ... )
from the environment ( GET )
from STDIN ( POST )
...
or, if you are calling external programs (system(), backticks, pipe open...)
( That "reduces" to "nearly every CGI program needs -T" )
Then use taint checking:
perldoc perlsec
--
Tad McClellan SGML Consulting
tadmc@metronet.com Perl programming
Fort Worth, Texas
------------------------------
Date: Thu, 29 Jun 2000 17:53:39 +0200
From: Abe Timmerman <abe@ztreet.demon.nl>
Subject: Re: explicit package name
Message-Id: <79rmlskp4rgap1eqmn0ravon96kerflmip@4ax.com>
On Thu, 29 Jun 2000 08:31:01 -0400, tadmc@metronet.com (Tad McClellan)
wrote:
> On Thu, 29 Jun 2000 09:18:33 GMT, maroun234@my-deja.com <maroun234@my-deja.com> wrote:
>
> >
> >I'm trying to compile a perl file.
> >I'm getting the following messge:
> >Global symbol '$HOST' requires explicit package name at line 50
>
>
> Have you already looked up the message in perldiag.pod?
>
>
> >line 50 has: my $HOST = 'private@201.43.2.2';
> ^^
> ^^ $HOST is lexically scoped
>
>
> >anyone has any ideas?
>
>
> If you are really getting that message with that code, then
> it looks like a bug in perl (but I doubt it).
No bug, just unclear messages, try this without diagnostics first:
(Yes, its on purpose!)
#!/usr/bin/perl -w
use strict;
use diagnostics;
my $foo = 'Foo'
my $bar = 'Bar';
__END__
> (It is much more likely that there is something that you are
> not telling us...)
Yeah, but that is the error that stands out, the
'synatax error near "my "'
bit, gets snowed under. Maybe such a fatal error should start with a
capitalized letter.
--
Good luck,
Abe
------------------------------
Date: Thu, 29 Jun 2000 17:50:05 +0200
From: Roger Tillmann <Roger.Tillmann@extern.rwso.de>
Subject: Finding out free disk space ???
Message-Id: <395B702C.1A079D7C@extern.rwso.de>
How can I find out how large a disk is (harddisk or network drive) and
how can I find out how much space is free on that drive?
I could live with a solution for Windows, but some more general solution
would be better.
can someone help me? PLEASE (begging)..
TIA
Roger
------------------------------
Date: Thu, 29 Jun 2000 09:08:08 -0600
From: "Colby Hansen" <cghansen@micron.com>
Subject: Help with hashes
Message-Id: <8jfoop$oth$1@admin-srv3.micron.com>
I'm having trouble with some of my code... again. Here's what I'm trying to
do:
my values = ();
my data = ();
my $row = 0;
while ($line = <INFILE>) {
@values = split(/,\s*/, $line);
my $column = 0;
foreach (@values){
$data{$row}{$column++} = $_;
}
$row++;
}
I want to be able to set an array, @list, equal to the values in a
particular column, $i. Also, how can I go through the data column by
column? Thanks!
------------------------------
Date: Thu, 29 Jun 2000 16:33:05 GMT
From: perl_monkey@my-deja.com
Subject: Re: Help with hashes
Message-Id: <8jfto0$gov$1@nnrp1.deja.com>
In article <8jfoop$oth$1@admin-srv3.micron.com>,
"Colby Hansen" <cghansen@micron.com> wrote:
> I'm having trouble with some of my code... again. Here's what I'm
trying to
> do:
>
> my values = ();
> my data = ();
??? I think you mean
my @values = ();
my @data = ();
> my $row = 0;
> while ($line = <INFILE>) {
> @values = split(/,\s*/, $line);
> my $column = 0;
> foreach (@values){
> $data{$row}{$column++} = $_;
Eh?
I think $data[$row][$column] would be better. {} is for hash slicing,
and if your keys are just going to be numbers, you may as well use an
array.
> }
> $row++;
> }
How about:
@matrix = ();
while(($line = <INFILE>)){
@array = split(/,\s*/, $line);
push(@matrix, \@array);
}
> I want to be able to set an array, @list, equal to the values in a
> particular column, $i. Also, how can I go through the data column by
> column? Thanks!
If you do the above, then row x, column y should be accessed with
$matrix[$x]->[$y];
Sent via Deja.com http://www.deja.com/
Before you buy.
------------------------------
Date: Thu, 29 Jun 2000 11:25:53 -0400
From: Jeff Pinyan <jeffp@crusoe.net>
Subject: Re: Help: regex for changing NON-absolute urls in a html file.
Message-Id: <Pine.GSO.4.21.0006291122240.25169-100000@crusoe.crusoe.net>
[posted & mailed]
First -- I strongly suggest using CPAN modules to PROPERLY parse
HTML. What I'm about to show is probably shun-worthy, but it works in
simple cases.
On Jun 28, Jan Bessels said:
>I've retrieved using LWP: a html file from a web-server. Before being
>usefull the contents has to be changed/parsed. I've experience with
>regexes but now I'm baffled. Have tried quite a few things.
>
>Basicaly, the following has to be done. If an url is relative
>(pics/menu.gif or relative to the root eg /content/pics/menu.gif) it has
>to be changed into http://www.mysite.com/pics/menu.gif and
>http://www.mysite.com/content/pics/menu.gif. In the latter case simply
>concatenating http://www.mysite.com and /content/pics/menu.gif isn't
>good because .com//content/ isn't valid html. Problem is that absolute
>urls like "http://www.somehting.com/.....gif" have to be left alone.
>The urls are used when using src= and href=. Of course src=url,
>src="url" and src='url' and src = url are valid html which can be
>found in the retrieved files.
Have you tried just putting a
<base href="http://www.myserver.com/">
in between the <head> and </head> tags when you get the HTML from the
site?
#!/usr/bin/perl
use LWP::Simple;
$content = get "http://www.crusoe.net/";
$content =~ s%<head>%<head>\n<base href="http://www.friday.net/">%;
print $content; # or whatever
>I'm confident the solution is elegant and simple but I currently don't
>see the solution.
Well, if you need a regular expression solution, I offer the following:
#!/usr/bin/perl -w
use LWP::Simple;
use strict;
my $URL = "http://www.yahoo.com/";
my $base = "http://www.altavista.com"; # NO trailing slash
my $content = get $URL;
my $attr_REx = << 'END';
(?: # optional attribs
\s+ # whitespace
\w+ # ATTR
(?:
\s* = \s* # or ATTR = or ATTR=
(?:
"[^"]*" | # "VALUE"
'[^']*' | # 'VALUE'
[^\s>]* # VALUE
)
)?
)*
END
my $url_REx = << 'END';
(?! ["']? (?: https?:// | mailto: | javascript: | ftp:// ) )
(
"[^"]*" | # "URL"
'[^']*' | # or 'URL'
[^\s>]* # or URL
)
END
$content =~ s{
( <a $attr_REx \s+ href \s* = \s* )
$url_REx
( $attr_REx \s* > )
}{
my $url = $2;
my @parts = ($1,$3);
$url =~ s%^(["']?)(.)%
$2 eq "/" ? "$1$base$2" : "$1$base/$2"
%se;
"$parts[0]$url$parts[1]";
}xegio;
$content =~ s{
( <img $attr_REx \s+ src \s* = \s* )
$url_REx
( $attr_REx \s* > )
}{
my $url = $2;
my @parts = ($1,$3);
$url =~ s%^(["']?)(.)%
$2 eq "/" ? "$1$base$2" : "$1$base/$2"
%se;
"$parts[0]$url$parts[1]";
}xegio;
print $content;
__END__
I don't know if that's "elegant", but if you want a slightly nicer-looking
version, for Perl 5.005 or ebtter, go to
http://www.pobox.com/~japhy/regexes/new_base_url
Randal (et. al.), if you see something terribly wrong with my HTML tag
parsing, let us all know.
--
Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/
PerlMonth - An Online Perl Magazine http://www.perlmonth.com/
The Perl Archive - Articles, Forums, etc. http://www.perlarchive.com/
CPAN - #1 Perl Resource (my id: PINYAN) http://search.cpan.org/
------------------------------
Date: Thu, 29 Jun 2000 18:20:31 +0200
From: Jan Bessels <jbessels@planet.nl>
Subject: Re: Help: regex for changing NON-absolute urls in a html file.
Message-Id: <395B774F.DB005B2E@planet.nl>
Jeff.
Also posted it. For starters, the help you AND others have offered is very
much appreciated.
Last evening I did d/l the HTTP::Munger module from CPAN, which was
partially
helpfull. It should transform relative path to absolute ones (and actw as a
filter as well) but haven't got it to work yet. Did dug into the module and
found some usefull regexes. Found also that Munger uses HTML::Parser.
Looked
into it and found out that using it I can easily extract and hence transform
the urls. Was struck by lightning when I found out HTML::LinkExtor.
Extracting/manipulating urls using the event drive scheme is now very easy.
Now all I need is some good regexes. The supplied ones (yours) look very
proffessional and I've no doubt they will do the trick.
The mentioned url: http://www.pobox.com/~japhy/regexes/new_base_url however
does not work. It
doesn't exist or doesn't allow me to enter. The hompage
(http://www.crusoe.net/~jeffp/#regexes) has a link to regular expressions
but
show "regexs yada yada yada...". Hmmmm. Mabye you or someone else can give
me
a new and working url. Any pointer to other good regex pages are also
appreciated (I have the mastering regexes book of J. Friedl as well as Perl
Cookbook).
As for adding a base=href. I'm not getting away with it this time ;-(
Reason,
co-branding. The middle column of www.origsite.com has to be placed and
replace the middle column www.cobranded-site.com site. Both sites use
relative
paths. Adding a base=href for www.origsite.com will break all gifs/links for
www.cobranded-site.com. Life is a challenge ;->
I work almost 1.5 years with Perl now. I still continue to find real Pearls
(gems) of modules. Also the Perl community is VERY friendly and
helpfull...........
/Wolverine.
Jeff Pinyan wrote:
> [posted & mailed]
>
> First -- I strongly suggest using CPAN modules to PROPERLY parse
> HTML. What I'm about to show is probably shun-worthy, but it works in
> simple cases.
>
> On Jun 28, Jan Bessels said:
>
> >I've retrieved using LWP: a html file from a web-server. Before being
> >usefull the contents has to be changed/parsed. I've experience with
> >regexes but now I'm baffled. Have tried quite a few things.
> >
> >Basicaly, the following has to be done. If an url is relative
> >(pics/menu.gif or relative to the root eg /content/pics/menu.gif) it has
> >to be changed into http://www.mysite.com/pics/menu.gif and
> >http://www.mysite.com/content/pics/menu.gif. In the latter case simply
> >concatenating http://www.mysite.com and /content/pics/menu.gif isn't
> >good because .com//content/ isn't valid html. Problem is that absolute
> >urls like "http://www.somehting.com/.....gif" have to be left alone.
> >The urls are used when using src= and href=. Of course src=url,
> >src="url" and src='url' and src = url are valid html which can be
> >found in the retrieved files.
>
> Have you tried just putting a
>
> <base href="http://www.myserver.com/">
>
> in between the <head> and </head> tags when you get the HTML from the
> site?
>
> #!/usr/bin/perl
> use LWP::Simple;
> $content = get "http://www.crusoe.net/";
> $content =~ s%<head>%<head>\n<base href="http://www.friday.net/">%;
> print $content; # or whatever
>
> >I'm confident the solution is elegant and simple but I currently don't
> >see the solution.
>
> Well, if you need a regular expression solution, I offer the following:
>
> #!/usr/bin/perl -w
>
> use LWP::Simple;
> use strict;
>
> my $URL = "http://www.yahoo.com/";
> my $base = "http://www.altavista.com"; # NO trailing slash
>
> my $content = get $URL;
>
> my $attr_REx = << 'END';
> (?: # optional attribs
> \s+ # whitespace
> \w+ # ATTR
> (?:
> \s* = \s* # or ATTR = or ATTR=
> (?:
> "[^"]*" | # "VALUE"
> '[^']*' | # 'VALUE'
> [^\s>]* # VALUE
> )
> )?
> )*
> END
>
> my $url_REx = << 'END';
> (?! ["']? (?: https?:// | mailto: | javascript: | ftp:// ) )
> (
> "[^"]*" | # "URL"
> '[^']*' | # or 'URL'
> [^\s>]* # or URL
> )
> END
>
> $content =~ s{
> ( <a $attr_REx \s+ href \s* = \s* )
> $url_REx
> ( $attr_REx \s* > )
> }{
> my $url = $2;
> my @parts = ($1,$3);
> $url =~ s%^(["']?)(.)%
> $2 eq "/" ? "$1$base$2" : "$1$base/$2"
> %se;
> "$parts[0]$url$parts[1]";
> }xegio;
>
> $content =~ s{
> ( <img $attr_REx \s+ src \s* = \s* )
> $url_REx
> ( $attr_REx \s* > )
> }{
> my $url = $2;
> my @parts = ($1,$3);
> $url =~ s%^(["']?)(.)%
> $2 eq "/" ? "$1$base$2" : "$1$base/$2"
> %se;
> "$parts[0]$url$parts[1]";
> }xegio;
>
> print $content;
>
> __END__
>
> I don't know if that's "elegant", but if you want a slightly nicer-looking
> version, for Perl 5.005 or ebtter, go to
>
> http://www.pobox.com/~japhy/regexes/new_base_url
>
> Randal (et. al.), if you see something terribly wrong with my HTML tag
> parsing, let us all know.
>
> --
> Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/
> PerlMonth - An Online Perl Magazine http://www.perlmonth.com/
> The Perl Archive - Articles, Forums, etc. http://www.perlarchive.com/
> CPAN - #1 Perl Resource (my id: PINYAN) http://search.cpan.org/
------------------------------
Date: Thu, 29 Jun 2000 13:32:00 -0400
From: Jeff Pinyan <jeffp@crusoe.net>
Subject: Re: Help: regex for changing NON-absolute urls in a html file.
Message-Id: <Pine.GSO.4.21.0006291329150.25169-100000@crusoe.crusoe.net>
[posted & mailed]
On Jun 29, Jan Bessels said:
>Now all I need is some good regexes. The supplied ones (yours) look very
>proffessional and I've no doubt they will do the trick.
Mine merely extracts URLs where expected. Please be sure you know this.
>The mentioned url: http://www.pobox.com/~japhy/regexes/new_base_url however
Sorry, I meant "change_base_url". I'll create the proper links on my home
page (yada yada yada is out of date, since I have several regex examples
now).
>a new and working url. Any pointer to other good regex pages are also
>appreciated (I have the mastering regexes book of J. Friedl as well as Perl
>Cookbook).
MRE and the Perl Cookbook, along with Perl's perlre documentation, are all
I need.
>> http://www.pobox.com/~japhy/regexes/new_base_url
change_base_url
--
Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/
PerlMonth - An Online Perl Magazine http://www.perlmonth.com/
The Perl Archive - Articles, Forums, etc. http://www.perlarchive.com/
CPAN - #1 Perl Resource (my id: PINYAN) http://search.cpan.org/
------------------------------
Date: Thu, 29 Jun 2000 13:38:36 -0400
From: Drew Simonis <care227@attglobal.net>
Subject: Re: Help: regex for changing NON-absolute urls in a html file.
Message-Id: <395B899C.83C00E02@attglobal.net>
Jeff Pinyan wrote:
> >> http://www.pobox.com/~japhy/regexes/new_base_url
>
> change_base_url
>
Forbidden
You don't have permission to access /~jeffp/regexes/change_base_url on
this server.
There was also some additional information available about the error:
[Thu Jun 29 13:39:55 2000] access to
/home/jeffp/pub_html/regexes/change_base_url
failed for ss08.ny.us.ibm.com, reason: file permissions deny server
access
Additionally, a 404 File Not Found error was encountered while trying
to use an ErrorDocument to handle the request.
------------------------------
Date: 16 Sep 99 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 16 Sep 99)
Message-Id: <null>
Administrivia:
The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc. For subscription or unsubscription requests, send
the single line:
subscribe perl-users
or:
unsubscribe perl-users
to almanac@ruby.oce.orst.edu.
| NOTE: The mail to news gateway, and thus the ability to submit articles
| through this service to the newsgroup, has been removed. I do not have
| time to individually vet each article to make sure that someone isn't
| abusing the service, and I no longer have any desire to waste my time
| dealing with the campus admins when some fool complains to them about an
| article that has come through the gateway instead of complaining
| to the source.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.
For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V9 Issue 3510
**************************************