[7202] in Perl-Users-Digest
Perl-Users Digest, Issue: 828 Volume: 8
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Aug 8 11:16:31 1997
Date: Fri, 8 Aug 97 08:01:30 -0700
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Fri, 8 Aug 1997 Volume: 8 Number: 828
Today's topics:
Re: parsing techniques (advice needed) <usenet-tag@qz.little-neck.ny.us>
Re: pattern match/replace (Tad McClellan)
Re: Perl to EXE <seay@absyss.fr>
printf denis@mathi.uni-heidelberg.de
Process forking <cpatil@home.com>
Re: Searching for "similar" words (Tony Bowden)
Re: searching man2html <patrick@arch.ethz.ch>
Re: translating accented characters to non-accented cha (Honza Pazdziora)
Re: Why won't this work!?! (Eric Bohlman)
Re: Why won't this work!?! (Honza Pazdziora)
Digest Administrivia (Last modified: 8 Mar 97) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 7 Aug 1997 17:18:20 GMT
From: Eli the Bearded <usenet-tag@qz.little-neck.ny.us>
Subject: Re: parsing techniques (advice needed)
Message-Id: <eli$9708071307@qz.little-neck.ny.us>
Tom Phoenix <rootbeer@teleport.com> wrote:
> On 5 Aug 1997, Ryan wrote:
> > I know there are modules for doing this, but just as an example, what is
> > the best way to go about parsing links from HTML pages and other similar
> > tasks?
> A URL has a well-defined syntax which I believe could be matched by a
> regular expression. Is that what you want, or are you needing to write a
> true parser in Perl? Hope this helps!
Abigail has posted the RE in the past. Hers is not very well optimized
AFAIK, but it works.
------ begin old article ------
From: abigail@ny.fnx.com (Abigail)
Subject: Re: regx for url
Message-ID: <E13n3E.Jv9@news2.new-york.net>
References: <wr3bucw59ig.fsf@kimbark.uchicago.edu>
Date: Tue, 19 Nov 1996 03:59:38 GMT
On Mon, 18 Nov 1996 04:09:58 GMT, Lyn A Headley wrote in comp.lang.perl.misc:
++ hi,
++
++ I've searched CPAN pretty hard, but can't seem to find
++ a regx which matches urls. can anyone point me to it?
++ (or better yet, post the regx :) ).
Post the regexp? Are you sure? It's pretty long, over 8k.
Ok, here it goes: [concatenate the following lines]
(?:http://(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?)(?:/(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;:@&=])*)(?:/(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-
Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;:@&=])*))*)(?:\?(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;:@&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;?&=])*)(?::(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;?&=])*))?@)?(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))
|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?))(?:/(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[?:@&=])*)(?:/(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:
[\dA-Fa-f])))|[?:@&=])*))*)(?:;type=(?:[AIDaid]))))|(?:news:(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|[.+_-])*)|(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;/?:&=])+@(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|
(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))|\*))|(?:nntp://(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?)/(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|[.+_-])*)(?:/(?:\d+))?)|(?:teln
et://(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;?&=])*)(?::(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;?&=])*))?@)?(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-
z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?))(?:/)?)|(?:gopher://(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?)(?:/(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'()
,]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f]))|(?:[;/?:@&=])))(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f]))|(?:[;/?:@&=]))*)(?:%09(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;:@&=])*)(?:%09(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f]))|(?:[;/?:@&=]))*))?)?)?)?)|(?:(?:wais://(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|
(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?)/(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))*))|(?:wais://(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(
?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?)/(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))*)\?(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;:@&=])*))|(?:wais://(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?
:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?)/(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))*)/(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))*)/(?:(?:(?:(?:(?:[a-
z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))*)))|(?:mailto:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f]))|(?:[;/?:@&=]))+))|(?:file://(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:
(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))|localhost)?/(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[?:@&=])*)(?:/(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[?:@&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z
])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?)/(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[?:@&=]))(?:/(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[?:@&=])))*)(?:(?:;(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[?:@&]))=(?:(?:(?:(?:
(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[?:@&]))))*)|(?:(?:(?:(?:[a-z])|(?:\d)|[+.-])+):(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f]))|(?:[;/?:@&=]))*|(?://(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;?&=])*)(?::(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f])))|[;?&=])*))?@)?(?:(?:(?:
(?:(?:(?:(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:(?:[a-z])|(?:[A-Z]))|\d))\.)*(?:(?:(?:(?:[a-z])|(?:[A-Z]))(?:(?:(?:(?:[a-z])|(?:[A-Z]))|\d)|-)*(?:(?:(?:[a-z])|(?:[A-Z]))|\d))|(?:(?:[a-z])|(?:[A-Z])))))|(?:(?:(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)))))(?::(?:(?:\d+)))?))(?:/(?:(?:(?:(?:(?:[a-z])|(?:[A-Z]))|(?:\d)|(?:[$_.+-])|(?:[!*'(),]))|(?:%(?:[\dA-Fa-f])(?:[\dA-Fa-f]))|(?:[;/?:@&=]))*))?)))
Perhaps you prefer a slowly build up? Here's a little program
constructed by turning the pseudo BNF of RFC 1738 into Perl
regexps:
#!/usr/local/bin/perl -w
use Carp;
use strict;
# Be paranoid about using grouping!
my $lowalpha = '(?:[a-z])';
my $hialpha = '(?:[A-Z])';
my $alpha = "(?:$lowalpha|$hialpha)";
my $digit = '(?:\d)';
my $safe = '(?:[$_.+-])';
my $extra = '(?:[!*\'(),])';
my $national = '(?:[{}|\\\\^~\[\]`])';
my $punctuation = '(?:[<>#%"])';
my $reserved = '(?:[;/?:@&=])';
my $hex = '(?:[\dA-Fa-f])';
my $escape = "(?:%$hex$hex)";
my $unreserved = "(?:$alpha|$digit|$safe|$extra)";
my $uchar = "(?:$unreserved|$escape)";
my $xchar = "(?:$unreserved|$escape|$reserved)";
my $digits = '(?:\d+)';
my $alphadigit = "(?:$alpha|\\d)";
# URL schemeparts for ip based protocols:
my $urlpath = "(?:$xchar*)";
my $user = "(?:(?:$uchar|[;?&=])*)";
my $password = "(?:(?:$uchar|[;?&=])*)";
my $port = "(?:$digits)";
my $hostnumber = "(?:$digits\\.$digits\\.$digits\\.$digits)";
my $toplabel = "(?:(?:$alpha(?:$alphadigit|-)*$alphadigit)|$alpha)";
my $domainlabel = "(?:(?:$alphadigit(?:$alphadigit|-)*$alphadigit)|" .
"$alphadigit)";
my $hostname = "(?:(?:$domainlabel\\.)*$toplabel)";
my $host = "(?:(?:$hostname)|(?:$hostnumber))";
my $hostport = "(?:(?:$host)(?::$port)?)";
my $login = "(?:(?:$user(?::$password)?\@)?$hostport)";
my $ip_schemepart = "(?://$login(?:/$urlpath)?)";
my $schemepart = "(?:$xchar*|$ip_schemepart)";
my $scheme = "(?:(?:$lowalpha|$digit|[+.-])+)";
# The generic form of a URL is:
my $genericurl = "(?:$scheme:$schemepart)";
# The predefined schemes:
# FTP (see also RFC959)
my $fsegment = "(?:(?:$uchar|[?:\@&=])*)";
my $ftptype = "(?:[AIDaid])";
my $fpath = "(?:$fsegment(?:/$fsegment)*)";
my $ftpurl = "(?:ftp://$login(?:/$fpath(?:;type=$ftptype)))";
# FILE
my $fileurl = "(?:file://(?:(?:$host)|localhost)?/$fpath)";
# HTTP
my $hsegment = "(?:(?:$uchar|[;:\@&=])*)";
my $search = "(?:(?:$uchar|[;:\@&=])*)";
my $hpath = "(?:$hsegment(?:/$hsegment)*)";
my $httpurl = "(?:http://$hostport(?:/$hpath(?:\\?$search)?)?)";
# GOPHER (see also RFC1436)
my $gopher_plus = "(?:$xchar*)";
my $selector = "(?:$xchar*)";
my $gtype = "(?:$xchar)";
my $gopherurl = "(?:gopher://$hostport(?:/$gtype(?:$selector" .
"(?:%09$search(?:%09$gopher_plus)?)?)?)?)";
# MAILTO (see also RFC822)
my $encoded822addr = "(?:$xchar+)";
my $mailtourl = "(?:mailto:$encoded822addr)";
# NEWS (see also RFC1036)
my $article = "(?:(?:$uchar|[;/?:&=])+\@$host)";
my $group = "(?:$alpha(?:$alpha|$digit|[.+_-])*)";
my $grouppart = "(?:$article|$group|\\*)";
my $newsurl = "(?:news:$grouppart)";
# NNTP (see also RFC977)
my $nntpurl = "(?:nntp://$hostport/$group(?:/$digits)?)";
# TELNET
my $telneturl = "(?:telnet://$login(?:/)?)";
# WAIS (see also RFC1625)
my $wpath = "(?:$uchar*)";
my $wtype = "(?:$uchar*)";
my $database = "(?:$uchar*)";
my $waisdoc = "(?:wais://$hostport/$database/$wtype/$wpath)";
my $waisindex = "(?:wais://$hostport/$database\\?$search)";
my $waisdatabase = "(?:wais://$hostport/$database)";
my $waisurl = "(?:$waisdatabase|$waisindex|$waisdoc)";
# PROSPERO
my $fieldvalue = "(?:(?:$uchar|[?:\@&]))";
my $fieldname = "(?:(?:$uchar|[?:\@&]))";
my $fieldspec = "(?:;$fieldname=$fieldvalue)";
my $psegment = "(?:(?:$uchar|[?:\@&=]))";
my $ppath = "(?:$psegment(?:/$psegment)*)";
my $prosperourl = "(?:prospero://$hostport/$ppath(?:$fieldspec)*)";
# new schemes follow the general syntax
my $otherurl = $genericurl;
# Specific predefined schemes are defined here; new schemes
# may be registered with IANA
my $url = "$httpurl|$ftpurl|$newsurl|$nntpurl|$telneturl|" .
"$gopherurl|$waisurl|$mailtourl|$fileurl|" .
"$prosperourl|$otherurl";
# Add a few tests. (Not much, I might have missed some things...)
my @lines = (
'http://www.ny.fnx.com/abigail/',
'http://www.ny.fnx.com/abigail?foobar No more URL!',
'ftp://foo.bar/foo/bar',
'ftp://abigail:password@foo.bar/baz;type=ABCDEF',
'foobar no URL!',
'Bla; bla; <mailto:abigail@ny.fnx.com>',
'<URL: news:wr3bucw59ig.fsf@kimbark.uchicago.edu>',
);
my $line;
foreach $line (@lines) {print "$&\n" if $line =~ m {$url};}
__END__
http://www.ny.fnx.com/abigail/
http://www.ny.fnx.com/abigail?foobar
ftp://foo.bar/foo/bar
ftp://abigail:password@foo.bar/baz;type=A
mailto:abigail@ny.fnx.com
news:wr3bucw59ig.fsf@kimbark.uchicago.edu
Abigail
------ end old article ------
Elijah
------
pack rat
------------------------------
Date: Wed, 6 Aug 1997 17:23:32 -0500
From: tadmc@flash.net (Tad McClellan)
Subject: Re: pattern match/replace
Message-Id: <4htas5.smn.ln@localhost>
Dan Brian (dan@clockwork.net) wrote:
: Perl users,
: Can someone solve this one? I need this substitution to occur, and it must
: occur in one line of code (I know this can be inefficient). Several lines
: could also be doable.
: Assuming this is my line of text:
: < blah blah KEYWORD blah blah KEYWORD blah blah blah > KEYWORD < blah
: KEYWORD blah blah >
: I then need a search/replace string that will remove KEYWORD from strings
: where it occurs withing the < > brackets. Note that the word can occur
: numerous times withing the brackets, and that there can be several bracket
: sets on the text line. I need the above string to translate to:
: < blah blah blah blah blah blah blah > KEYWORD < blah blah blah >
: Can someone offer me a clean solution?
1 while s/(<[^>]*)\s*KEYWORD\s*([^>]*>)/$1$2/g;
Note that this will not work if you <can have <nested> tags>...
--
Tad McClellan SGML Consulting
tadmc@flash.net Perl programming
Fort Worth, Texas
------------------------------
Date: Thu, 07 Aug 1997 11:47:36 +0200
From: Doug Seay <seay@absyss.fr>
Subject: Re: Perl to EXE
Message-Id: <33E999B8.27997288@absyss.fr>
Sergio Stateri Jr wrote:
>
> Hi ! Is tehre any way to transforma Perl program in a EXE runnable under
> Win32 ??
>
Have you read about the compiler? It is in the FAQ. I don't know if it
has any restrictions due to DOS.
- doug
------------------------------
Date: Fri, 08 Aug 1997 09:51:25 +0200
From: denis@mathi.uni-heidelberg.de
Subject: printf
Message-Id: <33EACFFD.41C6@mathi.uni-heidelberg.de>
hi all,
i've tryed to format the output of my perl script with printf.
but printf writes the output on the right end of the field.
how can i change this.
thanks
denis
------------------------------
Date: Thu, 07 Aug 1997 21:06:26 -0700
From: ChetanPatil <cpatil@home.com>
Subject: Process forking
Message-Id: <33EA9B42.5CE0F8C7@home.com>
I am trying to do the following:
Open log files..
Open a machinelist file
while (machinelist){
$machines{$_} = 1
fork
if parent, next
if child, rsh to $_, process and exit
}
foreach (keys %machines) {
fork
if parent, next
if child, rsh and do some more stuff and exit
}
Will this work? I think I am conceptually flawed, in a sense, does a
child process executes from the start?
The program, goes into a loop in the first while and keeps in rshing to
the machines.
Thanks,
Chetan
------------------------------
Date: 8 Aug 1997 09:41:22 GMT
From: tony@niweb.com (Tony Bowden)
Subject: Re: Searching for "similar" words
Message-Id: <5sepk2$erq$2@sparc.tibus.net>
Tony Cox (a.cox@rbgkew.org.uk) wrote:
: However, I'd also like to return "similar" word in case they have misspelled
: the word - that way I can present a list of all possible/likely database
: records. Is there a way of doing this with Perl other than searching using
: regexs that represent many varieties of substrings of the keyword?
Have a look at Text::Soundex
Tony
--
-----------------------------------------------------------------------------
Tony Bowden | tony@tmtm.com / t.bowden@qub.ac.uk / http://www.tmtm.com/
Belfast, NI | she tangles misery with flowers, her garden grows
-----------------------------------------------------------------------------
------------------------------
Date: Fri, 08 Aug 1997 10:54:31 +0200
From: Patrick Sibenaler <patrick@arch.ethz.ch>
Subject: Re: searching man2html
Message-Id: <33EADEC7.41C6@arch.ethz.ch>
jee... that thing at least should be around.. not?
--
---------------------------------------------------------------------------
The trick is to communicate bi-directional in real time and high
resolution
---------------------------------------------------------------------------
------------------------------
Date: Thu, 7 Aug 1997 20:17:39 GMT
From: adelton@fi.muni.cz (Honza Pazdziora)
Subject: Re: translating accented characters to non-accented chars
Message-Id: <adelton.870985059@aisa.fi.muni.cz>
[...]
> My problem is I have a file that has ascii characters above 128 (accented
> characters, etc). I need to translate these to non-accented chars, for
> example, e with an accent should become just plain e. I was thinking of
> creating a function with a long list of one-to-one translations with s///
> , but it seems like there must be an easier way.
I can offer you my package Cz::Cstocs. It was mainly aimed to the
Czech language charset conversions but because it includes iso-8859-1
and ascii encoding files, you might find it usefull too.
The use is like this:
use Cz::Cstocs;
my $il1_to_ascii = new Cz::Cstocs "il1", "ascii";
while (<>)
{ print &$il1_to_ascii($_); }
__END__
The conversion is done via symbolic names and diacritics of various
kind is correctly stripped.
You can find the module on CPAN in
authors/id/JANPAZ/Cstools-0.08.tar.gz
In the package there is also a command line utility cstocs with
similar use. Man pages are included.
Hope this helps.
--
------------------------------------------------------------------------
Honza Pazdziora | adelton@fi.muni.cz | http://www.fi.muni.cz/~adelton/
I can take or leave it if I please
European RC5 56 bit cracking effort -> http://www.cyberian.org/
------------------------------
Date: Fri, 8 Aug 1997 12:06:21 GMT
From: ebohlman@netcom.com (Eric Bohlman)
Subject: Re: Why won't this work!?!
Message-Id: <ebohlmanEELGAL.CB1@netcom.com>
Stephen Hill (buck@huron.net) wrote:
: The snipplet of script below sortta works, it does a search but it
: ignores the last part of the URL. For some reason I can't get an URL
: with an "&" to work right, it ignores everything to the right of the &
: in the url.
: $url = "http://www.mydomian.com/cgi-bin/search?text=ontario&offset=600";
: @lines = `lynx -source $url`;
: print "@lines";
The problem is that backticks work by passing the enclosed command to the
shell, and an ampersand is treated by the shell as a special character.
You'd run into the same problem if you typed the URL into the shell
(you'd need to escape the ampersand or enclose the URL in single quotes).
Have you considered using LWP::Simple to do the HTTP request directly
rather than using an external program?
------------------------------
Date: Fri, 8 Aug 1997 12:13:15 GMT
From: adelton@fi.muni.cz (Honza Pazdziora)
Subject: Re: Why won't this work!?!
Message-Id: <adelton.871042395@aisa.fi.muni.cz>
Stephen Hill <buck@huron.net> writes:
> The snipplet of script below sortta works, it does a search but it
> ignores the last part of the URL. For some reason I can't get an URL
> with an "&" to work right, it ignores everything to the right of the &
> in the url.
>
> $url = "http://www.mydomian.com/cgi-bin/search?text=ontario&offset=600";
>
> @lines = `lynx -source $url`;
> print "@lines";
Looks like Subject "Passing & through ``" would be better.
Since `lynx -source $url` starts a shell to parse the string and then
run lynx, it's very probable, that shell thinks that you wanted to run
the lynx on the background -- that's what & character means to shell.
You will either want to quote the $url to prevent shell from thinking
& is special
@lines = `lynx -source '$url'`;
or (preferred) use libwww-perl library that allows you to do the work
safely inside Perl, without forking new process and depending on the
lynx. You can find the library on CPAN (http://www.per.com/CPAN/).
> This almost works, but the "offset=600" is not passed to the lynx
> browser; is there anyway I can pass an URL(with an "&") to the lynx
> browser??????
Yes, you can, and there is no need to shout.
Hope this helps.
--
------------------------------------------------------------------------
Honza Pazdziora | adelton@fi.muni.cz | http://www.fi.muni.cz/~adelton/
I can take or leave it if I please
European RC5 56 bit cracking effort -> http://www.cyberian.org/
------------------------------
Date: 8 Mar 97 21:33:47 GMT (Last modified)
From: Perl-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 8 Mar 97)
Message-Id: <null>
Administrivia:
The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc. For subscription or unsubscription requests, send
the single line:
subscribe perl-users
or:
unsubscribe perl-users
to almanac@ruby.oce.orst.edu.
To submit articles to comp.lang.perl.misc (and this Digest), send your
article to perl-users@ruby.oce.orst.edu.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.
The Meta-FAQ, an article containing information about the FAQ, is
available by requesting "send perl-users meta-faq". The real FAQ, as it
appeared last in the newsgroup, can be retrieved with the request "send
perl-users FAQ". Due to their sizes, neither the Meta-FAQ nor the FAQ
are included in the digest.
The "mini-FAQ", which is an updated version of the Meta-FAQ, is
available by requesting "send perl-users mini-faq". It appears twice
weekly in the group, but is not distributed in the digest.
For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V8 Issue 828
*************************************