[30416] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 1659 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Jun 20 00:09:42 2008

Date: Thu, 19 Jun 2008 21:09:04 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 19 Jun 2008     Volume: 11 Number: 1659

Today's topics:
        cockroach race: grep for characters in any order <benkasminbullock@gmail.com>
    Re: cockroach race: grep for characters in any order <benkasminbullock@gmail.com>
    Re: extract all hotmail email addresses in a file and s vippstar@gmail.com
    Re: extract all hotmail email addresses in a file and s vippstar@gmail.com
    Re: extract all hotmail email addresses in a file and s <toe@lavabit.com>
    Re: extract all hotmail email addresses in a file and s <bc@freeuk.com>
    Re: extract all hotmail email addresses in a file and s <santosh.k83@gmail.com>
    Re: extract all hotmail email addresses in a file and s <pfiland@mindspring.com>
        Printing Problems <akhilshri@gmail.com>
    Re: Printing Problems (Jens Thoms Toerring)
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Thu, 19 Jun 2008 12:10:34 +0000 (UTC)
From: Ben Bullock <benkasminbullock@gmail.com>
Subject: cockroach race: grep for characters in any order
Message-Id: <g3dibq$fhf$1@ml.accsnet.ne.jp>

Following on from the discussion about finding all of a set of
characters in a string, here is a "cockroach race" I've made to see
which solution is faster. I ran this on Perl 5.10, so you might get
different results with some other version.

#!/usr/local/bin/perl
use warnings;
use strict;
use List::MoreUtils qw/all/;

sub bullock
{
    my ($s, @chars) = @_;
    my $anychar = join '', @chars;
    my $matchany = join '.*',("[$anychar]") x @chars;
    if ($s =~ /$matchany/) {
	my $copy = $s;
	for my $c (@chars) {
	    return 0 unless $copy =~ s/$c//g;
	}
	return 1;
    }
    return 0;
}

sub lalli
{
    my ($s, @chars) = @_;
    return all { $s =~ /$_/ } @chars;
}

sub jackman
{
    my ($s, @chars) = @_;
    my @re;
    for (@chars) {
	my $re = $_;
	for my $c (@chars) {
	    next if $c eq $_;
	    $re .= "(?=.*$c)";
	}
	push @re, $re;
    }
    my $re = join '|', @re;
    return $s =~ /$re/;
}

sub j_index
{
    my ($s, @chars) = @_;
    my $matched = 1;
    for my $char (@chars) {
	return 0 if index($s, $char) == -1;
    }
    return 1;
}

sub j_b
{

    my ($s, @chars) = @_;
    my $re;
    for my $c (@chars) {
	$re .= "(?=.*$c)";
    }
    return $s =~ /$re/;
}

sub j_b2
{

    my ($s, @chars) = @_;
    my $re = '^';
    for my $c (@chars) {
	$re .= "(?=.*$c)";
    }
    return $s =~ /$re/;
}

use Benchmark qw( cmpthese );

sub comparethem
{
    my ($text, $chars, $ret) = @_;
    my %chars = map {$_ => 1} split ('', $chars);
    my @chars = sort keys %chars;
     my $count = 100000;
     cmpthese $count, 
     {
      'bullock' => sub {bullock($text,@chars) == $ret or die '1'},
      'lalli'   => sub {lalli($text,@chars) == $ret or die '2'},
      'jackman' => sub {jackman($text,@chars) == $ret or die '3'},
      'j_index' => sub {j_index($text,@chars) == $ret or die '4'},
      'j_b' => sub {j_b($text,@chars) == $ret or die '5'},
      'j_b2' => sub {j_b($text,@chars) == $ret or die '6'},
  };
}

comparethem ("naninuneno", 'aiueo', 1);
comparethem ("naninuneni", 'aiueo', 0);
comparethem ("naninuneni", 'abcdefghijklmnopqrstuvwxyz', 0);
comparethem ('abcdefghijklmnopqrstuvwxyz'x100, 'aeiou', 1);

__END__

The clear winner on all four tests is Glenn Jackman's version
using "index":

            Rate bullock jackman   lalli     j_b    j_b2 j_index
bullock  20000/s      --    -17%    -41%    -74%    -74%    -84%
jackman  24213/s     21%      --    -29%    -68%    -68%    -80%
lalli    34014/s     70%     40%      --    -55%    -55%    -72%
j_b      75758/s    279%    213%    123%      --     -0%    -38%
j_b2     75758/s    279%    213%    123%      0%      --    -38%
j_index 121951/s    510%    404%    259%     61%     61%      --
            Rate jackman bullock   lalli     j_b    j_b2 j_index
jackman  21053/s      --    -10%    -37%    -72%    -72%    -85%
bullock  23419/s     11%      --    -30%    -69%    -69%    -83%
lalli    33670/s     60%     44%      --    -56%    -56%    -76%
j_b      75758/s    260%    223%    125%      --     -0%    -45%
j_b2     75758/s    260%    223%    125%      0%      --    -45%
j_index 138889/s    560%    493%    313%     83%     83%      --
           Rate jackman    j_b2     j_b   lalli bullock j_index
jackman  1386/s      --    -93%    -94%    -95%    -95%    -97%
j_b2    20921/s   1409%      --     -4%    -18%    -25%    -52%
j_b     21786/s   1472%      4%      --    -15%    -22%    -50%
lalli   25641/s   1750%     23%     18%      --     -8%    -42%
bullock 27855/s   1910%     33%     28%      9%      --    -36%
j_index 43860/s   3064%    110%    101%     71%     57%      --
           Rate bullock jackman   lalli    j_b2     j_b j_index
bullock  3557/s      --    -78%    -86%    -89%    -89%    -96%
jackman 16234/s    356%      --    -35%    -50%    -50%    -83%
lalli   25063/s    605%     54%      --    -23%    -24%    -74%
j_b2    32680/s    819%    101%     30%      --     -0%    -66%
j_b     32787/s    822%    102%     31%      0%      --    -66%
j_index 95238/s   2577%    487%    280%    191%    190%      --

However, second place is not so clear. Despite what Bart Lateur
thought, there is no difference between the performance of the
anchored and unanchored regexes (j_b and j_b2 above). The solution I
posed initially fails badly on a long input string. The regex solution
posted by Glenn Jackman fails particularly badly on a long list of
characters, because of the O(n^2) size of the regex. However, the
thing I initially posted does miles better with a long input string,
coming a close second.




------------------------------

Date: Thu, 19 Jun 2008 12:54:09 +0000 (UTC)
From: Ben Bullock <benkasminbullock@gmail.com>
Subject: Re: cockroach race: grep for characters in any order
Message-Id: <g3dkth$fu2$1@ml.accsnet.ne.jp>

On Thu, 19 Jun 2008 12:10:34 +0000, Ben Bullock wrote:

>       'j_index' => sub {j_index($text,@chars) == $ret or die '4'},
>       'j_b' => sub {j_b($text,@chars) == $ret or die '5'},
>       'j_b2' => sub {j_b($text,@chars) == $ret or die '6'},
                          ^
                          2

Ooops!


> However, second place is not so clear. Despite what Bart Lateur
> thought, there is no difference between the performance of the
> anchored and unanchored regexes (j_b and j_b2 above).

As Bart Lateur thought, adding the anchor ^ increased the speed quite a 
bit, especially with the negative matches (the second and third ones):

            Rate     j_b    j_b2 j_index
j_b      72993/s      --     -7%    -42%
j_b2     78740/s      8%      --    -37%
j_index 125000/s     71%     59%      --
            Rate     j_b    j_b2 j_index
j_b      70423/s      --    -11%    -49%
j_b2     79365/s     13%      --    -43%
j_index 138889/s     97%     75%      --
           Rate     j_b    j_b2 j_index
j_b     20704/s      --    -18%    -54%
j_b2    25126/s     21%      --    -45%
j_index 45455/s    120%     81%      --
           Rate     j_b    j_b2 j_index
j_b     31348/s      --     -3%    -62%
j_b2    32468/s      4%      --    -61%
j_index 83333/s    166%    157%      --



------------------------------

Date: Thu, 19 Jun 2008 03:55:14 -0700 (PDT)
From: vippstar@gmail.com
Subject: Re: extract all hotmail email addresses in a file and store in  separate file
Message-Id: <c19ed300-b766-472c-8c55-5235bf644cd1@8g2000hse.googlegroups.com>

On Jun 19, 12:55 pm, "Bartc" <b...@freeuk.com> wrote:
> <vipps...@gmail.com> wrote in message
>
> news:569670a8-4f4d-4101-ab7c-bcc50625ad94@l64g2000hse.googlegroups.com...
>
>
>
> > On Jun 19, 12:13 pm, "Bartc" <b...@freeuk.com> wrote:
> >> "Marc Bissonnette" <dragnet\_@_/internalysis.com> wrote in message
>
> >>news:Xns9AC29F465890dragnetinternalysisc@216.196.97.131...
>
> >> > pete <pfil...@mindspring.com> fell face-first on the keyboard. This was
> >> > the result:news:oeadnWrGl4RyQMTVnZ2dnUVZ_s7inZ2d@earthlink.com:
>
> >> >> Dennis wrote:
> >> >>> Hi, I have a text file that contents a list of email addresses like
> >> >>> this:
> >> >> /* BEGIN new.c output */
> >> >><snip 250+ lines of C >
> >> > Wow - All that just to separate @hotmail.com from anything else ? I'm
> >> > glad I stuck with perl :)
>
> >> I think pete just enjoys writing huge amounts of C code. Or showing off..
> > Or using concrete functions he has written in the past to write
> > concrete programs.
>
> I thought it was some sort of unwritten rule here that when posting code
> solutions you tend not to import large elements of your own library.
> Otherwise everyone would post their own different version of getline() and
> so on.
There's no such rule
> And also there's the possibility, as seems to have happened here, of using
> something inappropriate just because it's there. There's no reason at all to
> use a linked list to read all the input into memory (and risking
> out-of-memory or thrashing for large input).
What do you mean thrasing? The code risks nothing as all the calls to
malloc, etc are checked.
> (Although I suspect pete may have created this over-the-top solution on
> purpose..)
Yes, presumably the purpose was to provide the newbie with a concrete
example
> > concrete programs.
>
> Which is more concrete, this code which has a memory requirement of N or
> code using fixed memory?
It doesn't matter as long as error checking is there.


------------------------------

Date: Thu, 19 Jun 2008 04:39:41 -0700 (PDT)
From: vippstar@gmail.com
Subject: Re: extract all hotmail email addresses in a file and store in  separate file
Message-Id: <7115ea76-b76a-4498-9c24-8c7345a5272d@m3g2000hsc.googlegroups.com>

On Jun 19, 2:29 pm, "Bartc" <b...@freeuk.com> wrote:
> vipps...@gmail.com wrote:
> > On Jun 19, 12:55 pm, "Bartc" <b...@freeuk.com> wrote:
> >> <vipps...@gmail.com> wrote in message
> >>...There's no
> >> reason at all to use a linked list to read all the input into memory
> >> (and risking out-of-memory or thrashing for large input).
> > What do you mean thrasing? The code risks nothing as all the calls to
> > malloc, etc are checked.
>
> I mean the slow-down that occurs when memory gets nearly full.
While true this has nothing to do with C.

> >> Which is more concrete, this code which has a memory requirement of
> >> N or code using fixed memory?
> > It doesn't matter as long as error checking is there.
>
> No, "Sorry out of memory" is just as acceptable as "Task completed"!
A concrete example of code is one that cannot "break", ie behave
unexpectedly.


------------------------------

Date: Thu, 19 Jun 2008 04:59:18 -0700 (PDT)
From: =?ISO-8859-1?Q?Tom=E1s_=D3_h=C9ilidhe?= <toe@lavabit.com>
Subject: Re: extract all hotmail email addresses in a file and store in  separate file
Message-Id: <e1ae5b4e-85f4-4571-8d65-3153474965fb@m45g2000hsb.googlegroups.com>

On Jun 19, 6:41=A0am, vipps...@gmail.com wrote:

> That does not strip both of the " characters


Wups, meant to write strcpy(buf,original+1);


> char const is confusing, and the second const is unnecessary.
> fix:
> const char *original =3D "\"...\"";


"char const" is confusing?  :-O

You're right that the second const is unnecessary, just like my
breakfast this morning was unnecessary. I


> @ does not belong to C's basic character set, so, that's not possible.


I had a feeling it mightn't be.

One might argue that if you're dealing with strings that have an @
symbol in them on a particular platform, that the compiler for that
platform will have the @ character.



------------------------------

Date: Thu, 19 Jun 2008 11:29:33 GMT
From: "Bartc" <bc@freeuk.com>
Subject: Re: extract all hotmail email addresses in a file and store in separate file
Message-Id: <xqr6k.12114$E41.2265@text.news.virginmedia.com>

vippstar@gmail.com wrote:
> On Jun 19, 12:55 pm, "Bartc" <b...@freeuk.com> wrote:
>> <vipps...@gmail.com> wrote in message

>>...There's no
>> reason at all to use a linked list to read all the input into memory
>> (and risking out-of-memory or thrashing for large input).

> What do you mean thrasing? The code risks nothing as all the calls to
> malloc, etc are checked.

I mean the slow-down that occurs when memory gets nearly full.

>> Which is more concrete, this code which has a memory requirement of
>> N or code using fixed memory?
> It doesn't matter as long as error checking is there.

No, "Sorry out of memory" is just as acceptable as "Task completed"!

-- 
bartc 




------------------------------

Date: Thu, 19 Jun 2008 18:34:55 +0530
From: santosh <santosh.k83@gmail.com>
Subject: Re: extract all hotmail email addresses in a file and store in separate file
Message-Id: <g3dli3$rk9$1@registered.motzarella.org>

Bartc wrote:

> 
> <vippstar@gmail.com> wrote in message
>
news:569670a8-4f4d-4101-ab7c-bcc50625ad94@l64g2000hse.googlegroups.com...
>> On Jun 19, 12:13 pm, "Bartc" <b...@freeuk.com> wrote:
>>> "Marc Bissonnette" <dragnet\_@_/internalysis.com> wrote in message
>>>
>>> news:Xns9AC29F465890dragnetinternalysisc@216.196.97.131...
>>>
>>> > pete <pfil...@mindspring.com> fell face-first on the keyboard.
>>> > This was the
>>> > result:news:oeadnWrGl4RyQMTVnZ2dnUVZ_s7inZ2d@earthlink.com:
>>>
>>> >> Dennis wrote:
>>> >>> Hi, I have a text file that contents a list of email addresses
>>> >>> like this:
>>> >> /* BEGIN new.c output */
>>> >><snip 250+ lines of C >
>>> > Wow - All that just to separate @hotmail.com from anything else ?
>>> > I'm glad I stuck with perl :)
>>>
>>> I think pete just enjoys writing huge amounts of C code. Or showing
>>> off..
> 
>> Or using concrete functions he has written in the past to write
>> concrete programs.
> 
> I thought it was some sort of unwritten rule here that when posting
> code solutions you tend not to import large elements of your own
> library. Otherwise everyone would post their own different version of
> getline() and so on.

As it is, everyone does post different versions of code for the same
task (as this thread itself has brilliantly illustrated), so as long as
the post contains all the code to compile into a working program in a
self-sufficient manner, I don't see any harm in including something
from a personal library.

And pete has pre-written functions to read files into linked-lists. He
often posts a link to his website containing this and other C code
occasionally here in clc.

> And also there's the possibility, as seems to have happened here, of
> using something inappropriate just because it's there. There's no
> reason at all to use a linked list to read all the input into memory
> (and risking out-of-memory or thrashing for large input).

Well reading a file into a linked-list isn't exactly inappropriate, but
it may be overkill for the small fragment that the OP posted. But it
could be that the OP's actual file contains hundreds or thousands of
email addresses. Constructing a linked-list will obviously take more
storage than a plain linear array, but it makes some tasks like sorting
lines, inserting lines, deleting lines, etc., much more easier. I
suspect that this is the reason why pete uses them.

> (Although I suspect pete may have created this over-the-top solution
> on purpose..)

Hmm.

>> concrete programs.
> 
> Which is more concrete, this code which has a memory requirement of N
> or code using fixed memory?

Either code could run out memory on a sufficiently memory starved
system. Besides the linked-list approach has other advantages (which
may not be very pertinent to the particular task the OP wanted) which
must be considered in a fair comparison.



------------------------------

Date: Thu, 19 Jun 2008 08:25:22 -0400
From: pete <pfiland@mindspring.com>
Subject: Re: extract all hotmail email addresses in a file and store in separate file
Message-Id: <kNGdnUWaH-muzcfVnZ2dnUVZ_ozinZ2d@earthlink.com>

Bartc wrote:
> "Marc Bissonnette" <dragnet\_@_/internalysis.com> wrote in message 
> news:Xns9AC29F465890dragnetinternalysisc@216.196.97.131...
>> pete <pfiland@mindspring.com> fell face-first on the keyboard. This was
>> the result: news:oeadnWrGl4RyQMTVnZ2dnUVZ_s7inZ2d@earthlink.com:
>>
>>> Dennis wrote:
>>>> Hi, I have a text file that contents a list of email addresses like
>>>> this:
> 
>>> /* BEGIN new.c output */
> 
>>> <snip 250+ lines of C >
> 
>> Wow - All that just to separate @hotmail.com from anything else ? I'm
>> glad I stuck with perl :)
> 
> I think pete just enjoys writing huge amounts of C code. Or showing off..

I can see why you might think that.

> I thought my 50-line answer (posted to comp.lang.c only) might have been a 
> bit long because it didn't make clever use of scanf(), but at least it could 
> deal with /any number/ of email addresses from a file.
> 
> This code I /think/ only deals with the 4 email addresses in the OP's 
> example..

It deals with how many and whichever string literals
are placed into this macro:

#define STRINGS                                 \
{  "\"foo@yahoo.com\"", "\"tom@hotmail.com\"",  \
  "\"jerry@gmail.com\"", "\"tommy@apple.com\""}

The program uses the STRINGS macro to initialize the input file.

-- 
pete


------------------------------

Date: Thu, 19 Jun 2008 03:14:22 -0700 (PDT)
From: dakin999 <akhilshri@gmail.com>
Subject: Printing Problems
Message-Id: <ea02ae6f-1017-4ee8-bc20-c9d1b85dc698@d19g2000prm.googlegroups.com>

Hi,

I have following code which works ok. It does following:

1. reads data from a input file
2. puts the data into seperate variables in a array
3. reads from this array and prints out to another file

It works except that it prints the same record 4 times. I can see I
have missed some thing in my array definition as their are 4 elements
in array, it is printing 4 times each element and then moving to next
element till it reaches eof().


while (<input>)  #reading a line from file
# Read the line into a set of variables
   ($1,$2,$3,$4)=split(/,/,$_);
 ....
 ....
# Buid an array with these varaibles
   my @array = ([$1, $2, $3, $4]);
   foreach my $r(@array) {
   foreach (@$r){

 ...     print <out> "$1\n";
       print <out> "$2\n";
       print <out> "$3\n";
       print <out> "$4\n";
       print <out> "\n";


The out put is coming like this:

yellow
blue
orange
red

yellow
blue
orange
red

yellow
blue
orange
red

yellow
blue
orange
red

black
white
red
pink

black
white
red
pink

black
white
red
pink

black
white
red
pink

Clearly it should just print one time and go to the next record....

Please suggest.


------------------------------

Date: 19 Jun 2008 12:25:04 GMT
From: jt@toerring.de (Jens Thoms Toerring)
Subject: Re: Printing Problems
Message-Id: <6bv1h0F3dsuh6U1@mid.uni-berlin.de>

dakin999 <akhilshri@gmail.com> wrote:
> Hi,

> I have following code which works ok. It does following:

> 1. reads data from a input file
> 2. puts the data into seperate variables in a array
> 3. reads from this array and prints out to another file

> It works except that it prints the same record 4 times. I can see I
> have missed some thing in my array definition as their are 4 elements
> in array, it is printing 4 times each element and then moving to next
> element till it reaches eof().

> while (<input>)  #reading a line from file
> # Read the line into a set of variables
>    ($1,$2,$3,$4)=split(/,/,$_);

This can't be your real program since $1, $2 etc. are read-only
variables. That makes it difficult to guess what you're really
doing...

> ....
> ....
> # Buid an array with these varaibles
>    my @array = ([$1, $2, $3, $4]);
>    foreach my $r(@array) {
>    foreach (@$r){

> ...     print <out> "$1\n";
>        print <out> "$2\n";
>        print <out> "$3\n";
>        print <out> "$4\n";
>        print <out> "\n";

What do you expect $1, $2 etc. to be set to here? And the '<>'
around 'out' also can't be right. Please post your real code,
not something you just asssume to have some resemblance to
your code.

Why aren't you simply doing something like

use strict;
use warnings;

my ( $input, $out, @arr )  = ( *STDIN, *STDOUT );

push @arr, [ split /,/, $_  ] while <$input>;

for my $r ( @arr ) {
    print $out "$_\n" for @$r;
    print $out "\n";
}

or, even simpler, if you don't want to safe the data you read from
the input in an array:

use strict;
use warnings;

my ( $input, $out )  = ( *STDIN, *STDOUT );

while ( my $line = <$input> ) {
    print $out "$_\n" for split /,/, $line;
    print $out "\n";
}
                             Regards, Jens
-- 
  \   Jens Thoms Toerring  ___      jt@toerring.de
   \__________________________      http://toerring.de


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 1659
***************************************


home help back first fref pref prev next nref lref last post