[23828] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 6031 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jan 29 21:21:35 2004

Date: Thu, 29 Jan 2004 18:16:02 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 29 Jan 2004     Volume: 10 Number: 6031

Today's topics:
        Pattern matching for "best match" (Yash)
    Re: Pattern matching for "best match" <usenet@morrow.me.uk>
    Re: Pattern matching for "best match" <tore@aursand.no>
    Re: Pattern matching for "best match" <jwillmore@remove.adelphia.net>
    Re: Pattern matching for "best match" (Jim Keenan)
    Re: Pattern matching for "best match" (Steve)
    Re: Pattern matching for "best match" <Gunnar.News@gustra.org>
    Re: Pattern matching for "best match" (Sam Holden)
    Re: Pattern matching for "best match" ctcgag@hotmail.com
    Re: Pattern matching for "best match" <Gunnar.News@gustra.org>
        perl and inetd? <mikee@mikee.ath.cx>
    Re: perl and inetd? <mikee@mikee.ath.cx>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: 28 Jan 2004 04:00:05 -0800
From: yashgt@yahoo.com (Yash)
Subject: Pattern matching for "best match"
Message-Id: <5a373b1d.0401280400.556987c@posting.google.com>

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Thanks


------------------------------

Date: Wed, 28 Jan 2004 12:16:54 +0000 (UTC)
From: Ben Morrow <usenet@morrow.me.uk>
Subject: Re: Pattern matching for "best match"
Message-Id: <bv897m$mv4$1@wisteria.csv.warwick.ac.uk>


yashgt@yahoo.com (Yash) wrote:
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
> 
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
> 
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.

Define what you mean by the 'best' match. The one that matches the
most text? The intuitive (at least to me) meaning would involve
working out that (e.g.) /vodafone.*horoscope/ will match a subset of
the strings that /vodafone/ will match, and therefore the first is a
more specific ('better') match than the second. I would imagine this
would be pretty hard to work out just starting from the regexes.

Ben

-- 
Musica Dei donum optimi, trahit homines, trahit deos.    |
Musica truces mollit animos, tristesque mentes erigit.   |   ben@morrow.me.uk
Musica vel ipsas arbores et horridas movet feras.        |


------------------------------

Date: Wed, 28 Jan 2004 13:54:31 +0100
From: Tore Aursand <tore@aursand.no>
Subject: Re: Pattern matching for "best match"
Message-Id: <pan.2004.01.28.12.37.30.908640@aursand.no>

On Wed, 28 Jan 2004 04:00:05 -0800, Yash wrote:
> We have a list of regular expressions such as: vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
> 
> The total number of such reg expressions is few hundreds. Our program
> reads a large file with millions of URLs. For each URL, it has to find
> the best match in the list of regular expressions.

You never define "best" in your posting.  What do _you_ require to be a
perfect match?  Almost perfect match?  No match at all?  Etc.

> For example www.vodafome.com will match with vodafone.

It will?  Why?

> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for a
> given URL.

You could always try to treat the strings you match against a regular
expressions and in addition try to compare the strings letter by letter,
and in turn the distance between each letter in each string.

There are modules out there for comparing strings, as well.  I suggest you
check out CPAN;

  http://www.cpan.org/


-- 
Tore Aursand <tore@aursand.no>
"Life is pleasant. Death is peaceful. It's the transition that's
 troublesome." -- Isaac Asimov


------------------------------

Date: Wed, 28 Jan 2004 09:28:00 -0500
From: James Willmore <jwillmore@remove.adelphia.net>
Subject: Re: Pattern matching for "best match"
Message-Id: <pan.2004.01.28.14.27.58.233368@remove.adelphia.net>

On Wed, 28 Jan 2004 04:00:05 -0800, Yash wrote:

> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
> 
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
> 
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.

Are you trying to do web filtering or redirection?  Fuzzy logic searches? 
What?

And .... where's the code? :-)  Or is this where you're stuck?

NEI (Not Enoguh Information)

-- 
Jim

Copyright notice: all code written by the author in this post is
 released under the GPL. http://www.gnu.org/licenses/gpl.txt 
for more information.

a fortune quote ...
If all the world's a stage, I want to operate the trap door.   --
Paul Beatty 



------------------------------

Date: 28 Jan 2004 07:17:33 -0800
From: jkeen_via_google@yahoo.com (Jim Keenan)
Subject: Re: Pattern matching for "best match"
Message-Id: <196cb7af.0401280717.279160c0@posting.google.com>

yashgt@yahoo.com (Yash) wrote in message news:<5a373b1d.0401280400.556987c@posting.google.com>...
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
> 
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
> 
Define "best".


------------------------------

Date: 28 Jan 2004 07:36:49 -0800
From: ineverlookatthis@yahoo.com (Steve)
Subject: Re: Pattern matching for "best match"
Message-Id: <f0d57f86.0401280736.a402f80@posting.google.com>

yashgt@yahoo.com (Yash) wrote in message news:<5a373b1d.0401280400.556987c@posting.google.com>...
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
> 
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
> 
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.
> 
> Thanks

If you define 'the best match' as the string with the longest
continual stretch of characters identical to a target, then molecular
biology has developed some very efficient ways to do this to align DNA
(and protein) sequences. BLAST is the most commonly used and you can
get a description by Googling - essentially it takes a 'seed' of a few
characters, attempts to match and when it finds a match it looks
either side to extend the alignment and then ranks by length of
overlap. Molecular biologists need to worry about allowing
discrepancies and gaps within the match, but I suspect that for your
application a simple exact match is good enough.

I need not point out that the scale of your problem is tiny compared
with aligning genomes (but I did).

HTH

Steve


------------------------------

Date: Thu, 29 Jan 2004 06:21:08 +0100
From: Gunnar Strand <Gunnar.News@gustra.org>
Subject: Re: Pattern matching for "best match"
Message-Id: <bva558$3me$1@hudsucker.umdac.umu.se>

Hi Yash,

Yash wrote:
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
> 
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
> 
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.
> 
> Thanks

I assume you have put all patterns in "priority" order
so that the first hit is the "best". My very nice Perl
teacher suggested to generate an anonymous sub if I was
going to do what you are doing:

my $sub = 'sub test_for_match { $_ = $_[0]; '.
           join( '', map { "/$_/o && return $_; " } @patterns ) .
           'return "No match"; }';
eval { $sub };
test_for_match( 'www.vodafone.com' );

Haven't tested the code, but I hope I conveyed the message.
The subroutine gets compiled once so it has little overhead
and there are probably other optimization which can be done.

Regards,

/Gunnar



------------------------------

Date: 29 Jan 2004 05:44:52 GMT
From: sholden@flexal.cs.usyd.edu.au (Sam Holden)
Subject: Re: Pattern matching for "best match"
Message-Id: <slrnc1h7ek.bfi.sholden@flexal.cs.usyd.edu.au>

On Thu, 29 Jan 2004 06:21:08 +0100,
	Gunnar Strand <Gunnar.News@gustra.org> wrote:
> 
> I assume you have put all patterns in "priority" order
> so that the first hit is the "best". My very nice Perl
> teacher suggested to generate an anonymous sub if I was
> going to do what you are doing:
> 
> my $sub = 'sub test_for_match { $_ = $_[0]; '.
>            join( '', map { "/$_/o && return $_; " } @patterns ) .
>            'return "No match"; }';
> eval { $sub };
> test_for_match( 'www.vodafone.com' );

There is no anonymous sub in that code.

And the test_for_match subroutine never exists since you are using block
eval which doesn't do what you think it does. Anyway using an eval
correctly would just lead to more problems because the code is
completely wrong.

If your perl teacher suggested such code, find another perl teacher.
More likely you misinterpreted them.


> Haven't tested the code, but I hope I conveyed the message.
> The subroutine gets compiled once so it has little overhead
> and there are probably other optimization which can be done.

5 seconds of testing would show that it doesn't work, thanks for letting
lots of people do so instead of you doing so before posting it.

That style of code generating a subroutine isn't necessary since perl
5.005 (ie. since mid 1998) due to the introduction of qr//.

See the obvious FAQ:

perlfaq6: How do I efficiently match many regular expressions at once?

-- 
Sam Holden


------------------------------

Date: 29 Jan 2004 17:44:44 GMT
From: ctcgag@hotmail.com
Subject: Re: Pattern matching for "best match"
Message-Id: <20040129124444.809$TC@newsreader.com>

yashgt@yahoo.com (Yash) wrote:
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
>
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.

I assume the regular expressions are in order of "goodness".

warning "Untested code";
my @qr_rigo= map { qr/$_/ } @regex_in_goodness_order;
sub get_best {
  study $_[0]; ## Will this help?  Try and see.
  foreach (@qr_rigo) {
     return $_ if $_[0] =~ /$_/;
  };
  return;
};

while (my $url=<>) {
  chomp $url;
  my $b=get_best($url);
  ##
};


>
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.

If the above is not efficient enough (or if you have no way to order the
regexes by goodness by hand), then further optimization probably relies
minutes details and trade-offs which we don't know about.  (What you mean
by "best match", how many of the 1000s of regex are substrings of each
other, the skew in the actual data set, etc.)

Xho

-- 
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service              New Rate! $9.95/Month 50GB


------------------------------

Date: Thu, 29 Jan 2004 19:20:35 +0100
From: Gunnar Strand <Gunnar.News@gustra.org>
Subject: Re: Pattern matching for "best match"
Message-Id: <bvbiqj$jnf$1@hudsucker.umdac.umu.se>

Sam,

You are quite right, that was really, really bad
suggestion in more than one way. Sorry about that.

/Gunnar

Sam Holden wrote:
> On Thu, 29 Jan 2004 06:21:08 +0100,
> 	Gunnar Strand <Gunnar.News@gustra.org> wrote:
> 
>>I assume you have put all patterns in "priority" order
>>so that the first hit is the "best". My very nice Perl
>>teacher suggested to generate an anonymous sub if I was
>>going to do what you are doing:
>>
>>my $sub = 'sub test_for_match { $_ = $_[0]; '.
>>           join( '', map { "/$_/o && return $_; " } @patterns ) .
>>           'return "No match"; }';
>>eval { $sub };
>>test_for_match( 'www.vodafone.com' );
> 
> 
> There is no anonymous sub in that code.
> 
> And the test_for_match subroutine never exists since you are using block
> eval which doesn't do what you think it does. Anyway using an eval
> correctly would just lead to more problems because the code is
> completely wrong.
> 
> If your perl teacher suggested such code, find another perl teacher.
> More likely you misinterpreted them.
> 
> 
> 
>>Haven't tested the code, but I hope I conveyed the message.
>>The subroutine gets compiled once so it has little overhead
>>and there are probably other optimization which can be done.
> 
> 
> 5 seconds of testing would show that it doesn't work, thanks for letting
> lots of people do so instead of you doing so before posting it.
> 
> That style of code generating a subroutine isn't necessary since perl
> 5.005 (ie. since mid 1998) due to the introduction of qr//.
> 
> See the obvious FAQ:
> 
> perlfaq6: How do I efficiently match many regular expressions at once?
> 



------------------------------

Date: Mon, 26 Jan 2004 15:32:52 -0000
From: Mike <mikee@mikee.ath.cx>
Subject: perl and inetd?
Message-Id: <101acp4elrcdr76@corp.supernews.com>

I'm trying something I've done many times before, but it's not
working this time. I have a simple script, even a bourne shell
script, to run under inetd and print an uptime to stdout and
to the connecting telnet (I'm on AIX). For the perl I have used
select(STDOUT); and $| = 1;, but stuff isn't working. Has anyone
run across this? I've done this before without problems, but
no on this specific server. What's going on?

Mike


------------------------------

Date: Mon, 26 Jan 2004 17:01:29 -0000
From: Mike <mikee@mikee.ath.cx>
Subject: Re: perl and inetd?
Message-Id: <101ahv9bkmt9t49@corp.supernews.com>

In article <101acp4elrcdr76@corp.supernews.com>, Mike wrote:
> I'm trying something I've done many times before, but it's not
> working this time. I have a simple script, even a bourne shell
> script, to run under inetd and print an uptime to stdout and
> to the connecting telnet (I'm on AIX). For the perl I have used
> select(STDOUT); and $| = 1;, but stuff isn't working. Has anyone
> run across this? I've done this before without problems, but
> no on this specific server. What's going on?
> 
> Mike

I have it working now. The only things I really changed is
changing in /etc/inetd.conf the 'wait' to 'nowait' and instead
of 'kill -1 inetd.pid' I fully killed and restarted inetd.

Mike


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 6031
***************************************


home help back first fref pref prev next nref lref last post