[30416] in Perl-Users-Digest
Perl-Users Digest, Issue: 1659 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Jun 20 00:09:42 2008
Date: Thu, 19 Jun 2008 21:09:04 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Thu, 19 Jun 2008 Volume: 11 Number: 1659
Today's topics:
cockroach race: grep for characters in any order <benkasminbullock@gmail.com>
Re: cockroach race: grep for characters in any order <benkasminbullock@gmail.com>
Re: extract all hotmail email addresses in a file and s vippstar@gmail.com
Re: extract all hotmail email addresses in a file and s vippstar@gmail.com
Re: extract all hotmail email addresses in a file and s <toe@lavabit.com>
Re: extract all hotmail email addresses in a file and s <bc@freeuk.com>
Re: extract all hotmail email addresses in a file and s <santosh.k83@gmail.com>
Re: extract all hotmail email addresses in a file and s <pfiland@mindspring.com>
Printing Problems <akhilshri@gmail.com>
Re: Printing Problems (Jens Thoms Toerring)
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Thu, 19 Jun 2008 12:10:34 +0000 (UTC)
From: Ben Bullock <benkasminbullock@gmail.com>
Subject: cockroach race: grep for characters in any order
Message-Id: <g3dibq$fhf$1@ml.accsnet.ne.jp>
Following on from the discussion about finding all of a set of
characters in a string, here is a "cockroach race" I've made to see
which solution is faster. I ran this on Perl 5.10, so you might get
different results with some other version.
#!/usr/local/bin/perl
use warnings;
use strict;
use List::MoreUtils qw/all/;
sub bullock
{
my ($s, @chars) = @_;
my $anychar = join '', @chars;
my $matchany = join '.*',("[$anychar]") x @chars;
if ($s =~ /$matchany/) {
my $copy = $s;
for my $c (@chars) {
return 0 unless $copy =~ s/$c//g;
}
return 1;
}
return 0;
}
sub lalli
{
my ($s, @chars) = @_;
return all { $s =~ /$_/ } @chars;
}
sub jackman
{
my ($s, @chars) = @_;
my @re;
for (@chars) {
my $re = $_;
for my $c (@chars) {
next if $c eq $_;
$re .= "(?=.*$c)";
}
push @re, $re;
}
my $re = join '|', @re;
return $s =~ /$re/;
}
sub j_index
{
my ($s, @chars) = @_;
my $matched = 1;
for my $char (@chars) {
return 0 if index($s, $char) == -1;
}
return 1;
}
sub j_b
{
my ($s, @chars) = @_;
my $re;
for my $c (@chars) {
$re .= "(?=.*$c)";
}
return $s =~ /$re/;
}
sub j_b2
{
my ($s, @chars) = @_;
my $re = '^';
for my $c (@chars) {
$re .= "(?=.*$c)";
}
return $s =~ /$re/;
}
use Benchmark qw( cmpthese );
sub comparethem
{
my ($text, $chars, $ret) = @_;
my %chars = map {$_ => 1} split ('', $chars);
my @chars = sort keys %chars;
my $count = 100000;
cmpthese $count,
{
'bullock' => sub {bullock($text,@chars) == $ret or die '1'},
'lalli' => sub {lalli($text,@chars) == $ret or die '2'},
'jackman' => sub {jackman($text,@chars) == $ret or die '3'},
'j_index' => sub {j_index($text,@chars) == $ret or die '4'},
'j_b' => sub {j_b($text,@chars) == $ret or die '5'},
'j_b2' => sub {j_b($text,@chars) == $ret or die '6'},
};
}
comparethem ("naninuneno", 'aiueo', 1);
comparethem ("naninuneni", 'aiueo', 0);
comparethem ("naninuneni", 'abcdefghijklmnopqrstuvwxyz', 0);
comparethem ('abcdefghijklmnopqrstuvwxyz'x100, 'aeiou', 1);
__END__
The clear winner on all four tests is Glenn Jackman's version
using "index":
Rate bullock jackman lalli j_b j_b2 j_index
bullock 20000/s -- -17% -41% -74% -74% -84%
jackman 24213/s 21% -- -29% -68% -68% -80%
lalli 34014/s 70% 40% -- -55% -55% -72%
j_b 75758/s 279% 213% 123% -- -0% -38%
j_b2 75758/s 279% 213% 123% 0% -- -38%
j_index 121951/s 510% 404% 259% 61% 61% --
Rate jackman bullock lalli j_b j_b2 j_index
jackman 21053/s -- -10% -37% -72% -72% -85%
bullock 23419/s 11% -- -30% -69% -69% -83%
lalli 33670/s 60% 44% -- -56% -56% -76%
j_b 75758/s 260% 223% 125% -- -0% -45%
j_b2 75758/s 260% 223% 125% 0% -- -45%
j_index 138889/s 560% 493% 313% 83% 83% --
Rate jackman j_b2 j_b lalli bullock j_index
jackman 1386/s -- -93% -94% -95% -95% -97%
j_b2 20921/s 1409% -- -4% -18% -25% -52%
j_b 21786/s 1472% 4% -- -15% -22% -50%
lalli 25641/s 1750% 23% 18% -- -8% -42%
bullock 27855/s 1910% 33% 28% 9% -- -36%
j_index 43860/s 3064% 110% 101% 71% 57% --
Rate bullock jackman lalli j_b2 j_b j_index
bullock 3557/s -- -78% -86% -89% -89% -96%
jackman 16234/s 356% -- -35% -50% -50% -83%
lalli 25063/s 605% 54% -- -23% -24% -74%
j_b2 32680/s 819% 101% 30% -- -0% -66%
j_b 32787/s 822% 102% 31% 0% -- -66%
j_index 95238/s 2577% 487% 280% 191% 190% --
However, second place is not so clear. Despite what Bart Lateur
thought, there is no difference between the performance of the
anchored and unanchored regexes (j_b and j_b2 above). The solution I
posed initially fails badly on a long input string. The regex solution
posted by Glenn Jackman fails particularly badly on a long list of
characters, because of the O(n^2) size of the regex. However, the
thing I initially posted does miles better with a long input string,
coming a close second.
------------------------------
Date: Thu, 19 Jun 2008 12:54:09 +0000 (UTC)
From: Ben Bullock <benkasminbullock@gmail.com>
Subject: Re: cockroach race: grep for characters in any order
Message-Id: <g3dkth$fu2$1@ml.accsnet.ne.jp>
On Thu, 19 Jun 2008 12:10:34 +0000, Ben Bullock wrote:
> 'j_index' => sub {j_index($text,@chars) == $ret or die '4'},
> 'j_b' => sub {j_b($text,@chars) == $ret or die '5'},
> 'j_b2' => sub {j_b($text,@chars) == $ret or die '6'},
^
2
Ooops!
> However, second place is not so clear. Despite what Bart Lateur
> thought, there is no difference between the performance of the
> anchored and unanchored regexes (j_b and j_b2 above).
As Bart Lateur thought, adding the anchor ^ increased the speed quite a
bit, especially with the negative matches (the second and third ones):
Rate j_b j_b2 j_index
j_b 72993/s -- -7% -42%
j_b2 78740/s 8% -- -37%
j_index 125000/s 71% 59% --
Rate j_b j_b2 j_index
j_b 70423/s -- -11% -49%
j_b2 79365/s 13% -- -43%
j_index 138889/s 97% 75% --
Rate j_b j_b2 j_index
j_b 20704/s -- -18% -54%
j_b2 25126/s 21% -- -45%
j_index 45455/s 120% 81% --
Rate j_b j_b2 j_index
j_b 31348/s -- -3% -62%
j_b2 32468/s 4% -- -61%
j_index 83333/s 166% 157% --
------------------------------
Date: Thu, 19 Jun 2008 03:55:14 -0700 (PDT)
From: vippstar@gmail.com
Subject: Re: extract all hotmail email addresses in a file and store in separate file
Message-Id: <c19ed300-b766-472c-8c55-5235bf644cd1@8g2000hse.googlegroups.com>
On Jun 19, 12:55 pm, "Bartc" <b...@freeuk.com> wrote:
> <vipps...@gmail.com> wrote in message
>
> news:569670a8-4f4d-4101-ab7c-bcc50625ad94@l64g2000hse.googlegroups.com...
>
>
>
> > On Jun 19, 12:13 pm, "Bartc" <b...@freeuk.com> wrote:
> >> "Marc Bissonnette" <dragnet\_@_/internalysis.com> wrote in message
>
> >>news:Xns9AC29F465890dragnetinternalysisc@216.196.97.131...
>
> >> > pete <pfil...@mindspring.com> fell face-first on the keyboard. This was
> >> > the result:news:oeadnWrGl4RyQMTVnZ2dnUVZ_s7inZ2d@earthlink.com:
>
> >> >> Dennis wrote:
> >> >>> Hi, I have a text file that contents a list of email addresses like
> >> >>> this:
> >> >> /* BEGIN new.c output */
> >> >><snip 250+ lines of C >
> >> > Wow - All that just to separate @hotmail.com from anything else ? I'm
> >> > glad I stuck with perl :)
>
> >> I think pete just enjoys writing huge amounts of C code. Or showing off..
> > Or using concrete functions he has written in the past to write
> > concrete programs.
>
> I thought it was some sort of unwritten rule here that when posting code
> solutions you tend not to import large elements of your own library.
> Otherwise everyone would post their own different version of getline() and
> so on.
There's no such rule
> And also there's the possibility, as seems to have happened here, of using
> something inappropriate just because it's there. There's no reason at all to
> use a linked list to read all the input into memory (and risking
> out-of-memory or thrashing for large input).
What do you mean thrasing? The code risks nothing as all the calls to
malloc, etc are checked.
> (Although I suspect pete may have created this over-the-top solution on
> purpose..)
Yes, presumably the purpose was to provide the newbie with a concrete
example
> > concrete programs.
>
> Which is more concrete, this code which has a memory requirement of N or
> code using fixed memory?
It doesn't matter as long as error checking is there.
------------------------------
Date: Thu, 19 Jun 2008 04:39:41 -0700 (PDT)
From: vippstar@gmail.com
Subject: Re: extract all hotmail email addresses in a file and store in separate file
Message-Id: <7115ea76-b76a-4498-9c24-8c7345a5272d@m3g2000hsc.googlegroups.com>
On Jun 19, 2:29 pm, "Bartc" <b...@freeuk.com> wrote:
> vipps...@gmail.com wrote:
> > On Jun 19, 12:55 pm, "Bartc" <b...@freeuk.com> wrote:
> >> <vipps...@gmail.com> wrote in message
> >>...There's no
> >> reason at all to use a linked list to read all the input into memory
> >> (and risking out-of-memory or thrashing for large input).
> > What do you mean thrasing? The code risks nothing as all the calls to
> > malloc, etc are checked.
>
> I mean the slow-down that occurs when memory gets nearly full.
While true this has nothing to do with C.
> >> Which is more concrete, this code which has a memory requirement of
> >> N or code using fixed memory?
> > It doesn't matter as long as error checking is there.
>
> No, "Sorry out of memory" is just as acceptable as "Task completed"!
A concrete example of code is one that cannot "break", ie behave
unexpectedly.
------------------------------
Date: Thu, 19 Jun 2008 04:59:18 -0700 (PDT)
From: =?ISO-8859-1?Q?Tom=E1s_=D3_h=C9ilidhe?= <toe@lavabit.com>
Subject: Re: extract all hotmail email addresses in a file and store in separate file
Message-Id: <e1ae5b4e-85f4-4571-8d65-3153474965fb@m45g2000hsb.googlegroups.com>
On Jun 19, 6:41=A0am, vipps...@gmail.com wrote:
> That does not strip both of the " characters
Wups, meant to write strcpy(buf,original+1);
> char const is confusing, and the second const is unnecessary.
> fix:
> const char *original =3D "\"...\"";
"char const" is confusing? :-O
You're right that the second const is unnecessary, just like my
breakfast this morning was unnecessary. I
> @ does not belong to C's basic character set, so, that's not possible.
I had a feeling it mightn't be.
One might argue that if you're dealing with strings that have an @
symbol in them on a particular platform, that the compiler for that
platform will have the @ character.
------------------------------
Date: Thu, 19 Jun 2008 11:29:33 GMT
From: "Bartc" <bc@freeuk.com>
Subject: Re: extract all hotmail email addresses in a file and store in separate file
Message-Id: <xqr6k.12114$E41.2265@text.news.virginmedia.com>
vippstar@gmail.com wrote:
> On Jun 19, 12:55 pm, "Bartc" <b...@freeuk.com> wrote:
>> <vipps...@gmail.com> wrote in message
>>...There's no
>> reason at all to use a linked list to read all the input into memory
>> (and risking out-of-memory or thrashing for large input).
> What do you mean thrasing? The code risks nothing as all the calls to
> malloc, etc are checked.
I mean the slow-down that occurs when memory gets nearly full.
>> Which is more concrete, this code which has a memory requirement of
>> N or code using fixed memory?
> It doesn't matter as long as error checking is there.
No, "Sorry out of memory" is just as acceptable as "Task completed"!
--
bartc
------------------------------
Date: Thu, 19 Jun 2008 18:34:55 +0530
From: santosh <santosh.k83@gmail.com>
Subject: Re: extract all hotmail email addresses in a file and store in separate file
Message-Id: <g3dli3$rk9$1@registered.motzarella.org>
Bartc wrote:
>
> <vippstar@gmail.com> wrote in message
>
news:569670a8-4f4d-4101-ab7c-bcc50625ad94@l64g2000hse.googlegroups.com...
>> On Jun 19, 12:13 pm, "Bartc" <b...@freeuk.com> wrote:
>>> "Marc Bissonnette" <dragnet\_@_/internalysis.com> wrote in message
>>>
>>> news:Xns9AC29F465890dragnetinternalysisc@216.196.97.131...
>>>
>>> > pete <pfil...@mindspring.com> fell face-first on the keyboard.
>>> > This was the
>>> > result:news:oeadnWrGl4RyQMTVnZ2dnUVZ_s7inZ2d@earthlink.com:
>>>
>>> >> Dennis wrote:
>>> >>> Hi, I have a text file that contents a list of email addresses
>>> >>> like this:
>>> >> /* BEGIN new.c output */
>>> >><snip 250+ lines of C >
>>> > Wow - All that just to separate @hotmail.com from anything else ?
>>> > I'm glad I stuck with perl :)
>>>
>>> I think pete just enjoys writing huge amounts of C code. Or showing
>>> off..
>
>> Or using concrete functions he has written in the past to write
>> concrete programs.
>
> I thought it was some sort of unwritten rule here that when posting
> code solutions you tend not to import large elements of your own
> library. Otherwise everyone would post their own different version of
> getline() and so on.
As it is, everyone does post different versions of code for the same
task (as this thread itself has brilliantly illustrated), so as long as
the post contains all the code to compile into a working program in a
self-sufficient manner, I don't see any harm in including something
from a personal library.
And pete has pre-written functions to read files into linked-lists. He
often posts a link to his website containing this and other C code
occasionally here in clc.
> And also there's the possibility, as seems to have happened here, of
> using something inappropriate just because it's there. There's no
> reason at all to use a linked list to read all the input into memory
> (and risking out-of-memory or thrashing for large input).
Well reading a file into a linked-list isn't exactly inappropriate, but
it may be overkill for the small fragment that the OP posted. But it
could be that the OP's actual file contains hundreds or thousands of
email addresses. Constructing a linked-list will obviously take more
storage than a plain linear array, but it makes some tasks like sorting
lines, inserting lines, deleting lines, etc., much more easier. I
suspect that this is the reason why pete uses them.
> (Although I suspect pete may have created this over-the-top solution
> on purpose..)
Hmm.
>> concrete programs.
>
> Which is more concrete, this code which has a memory requirement of N
> or code using fixed memory?
Either code could run out memory on a sufficiently memory starved
system. Besides the linked-list approach has other advantages (which
may not be very pertinent to the particular task the OP wanted) which
must be considered in a fair comparison.
------------------------------
Date: Thu, 19 Jun 2008 08:25:22 -0400
From: pete <pfiland@mindspring.com>
Subject: Re: extract all hotmail email addresses in a file and store in separate file
Message-Id: <kNGdnUWaH-muzcfVnZ2dnUVZ_ozinZ2d@earthlink.com>
Bartc wrote:
> "Marc Bissonnette" <dragnet\_@_/internalysis.com> wrote in message
> news:Xns9AC29F465890dragnetinternalysisc@216.196.97.131...
>> pete <pfiland@mindspring.com> fell face-first on the keyboard. This was
>> the result: news:oeadnWrGl4RyQMTVnZ2dnUVZ_s7inZ2d@earthlink.com:
>>
>>> Dennis wrote:
>>>> Hi, I have a text file that contents a list of email addresses like
>>>> this:
>
>>> /* BEGIN new.c output */
>
>>> <snip 250+ lines of C >
>
>> Wow - All that just to separate @hotmail.com from anything else ? I'm
>> glad I stuck with perl :)
>
> I think pete just enjoys writing huge amounts of C code. Or showing off..
I can see why you might think that.
> I thought my 50-line answer (posted to comp.lang.c only) might have been a
> bit long because it didn't make clever use of scanf(), but at least it could
> deal with /any number/ of email addresses from a file.
>
> This code I /think/ only deals with the 4 email addresses in the OP's
> example..
It deals with how many and whichever string literals
are placed into this macro:
#define STRINGS \
{ "\"foo@yahoo.com\"", "\"tom@hotmail.com\"", \
"\"jerry@gmail.com\"", "\"tommy@apple.com\""}
The program uses the STRINGS macro to initialize the input file.
--
pete
------------------------------
Date: Thu, 19 Jun 2008 03:14:22 -0700 (PDT)
From: dakin999 <akhilshri@gmail.com>
Subject: Printing Problems
Message-Id: <ea02ae6f-1017-4ee8-bc20-c9d1b85dc698@d19g2000prm.googlegroups.com>
Hi,
I have following code which works ok. It does following:
1. reads data from a input file
2. puts the data into seperate variables in a array
3. reads from this array and prints out to another file
It works except that it prints the same record 4 times. I can see I
have missed some thing in my array definition as their are 4 elements
in array, it is printing 4 times each element and then moving to next
element till it reaches eof().
while (<input>) #reading a line from file
# Read the line into a set of variables
($1,$2,$3,$4)=split(/,/,$_);
....
....
# Buid an array with these varaibles
my @array = ([$1, $2, $3, $4]);
foreach my $r(@array) {
foreach (@$r){
... print <out> "$1\n";
print <out> "$2\n";
print <out> "$3\n";
print <out> "$4\n";
print <out> "\n";
The out put is coming like this:
yellow
blue
orange
red
yellow
blue
orange
red
yellow
blue
orange
red
yellow
blue
orange
red
black
white
red
pink
black
white
red
pink
black
white
red
pink
black
white
red
pink
Clearly it should just print one time and go to the next record....
Please suggest.
------------------------------
Date: 19 Jun 2008 12:25:04 GMT
From: jt@toerring.de (Jens Thoms Toerring)
Subject: Re: Printing Problems
Message-Id: <6bv1h0F3dsuh6U1@mid.uni-berlin.de>
dakin999 <akhilshri@gmail.com> wrote:
> Hi,
> I have following code which works ok. It does following:
> 1. reads data from a input file
> 2. puts the data into seperate variables in a array
> 3. reads from this array and prints out to another file
> It works except that it prints the same record 4 times. I can see I
> have missed some thing in my array definition as their are 4 elements
> in array, it is printing 4 times each element and then moving to next
> element till it reaches eof().
> while (<input>) #reading a line from file
> # Read the line into a set of variables
> ($1,$2,$3,$4)=split(/,/,$_);
This can't be your real program since $1, $2 etc. are read-only
variables. That makes it difficult to guess what you're really
doing...
> ....
> ....
> # Buid an array with these varaibles
> my @array = ([$1, $2, $3, $4]);
> foreach my $r(@array) {
> foreach (@$r){
> ... print <out> "$1\n";
> print <out> "$2\n";
> print <out> "$3\n";
> print <out> "$4\n";
> print <out> "\n";
What do you expect $1, $2 etc. to be set to here? And the '<>'
around 'out' also can't be right. Please post your real code,
not something you just asssume to have some resemblance to
your code.
Why aren't you simply doing something like
use strict;
use warnings;
my ( $input, $out, @arr ) = ( *STDIN, *STDOUT );
push @arr, [ split /,/, $_ ] while <$input>;
for my $r ( @arr ) {
print $out "$_\n" for @$r;
print $out "\n";
}
or, even simpler, if you don't want to safe the data you read from
the input in an array:
use strict;
use warnings;
my ( $input, $out ) = ( *STDIN, *STDOUT );
while ( my $line = <$input> ) {
print $out "$_\n" for split /,/, $line;
print $out "\n";
}
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 1659
***************************************