[33076] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 4352 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Jan 20 00:09:17 2015

Date: Mon, 19 Jan 2015 21:09:02 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Mon, 19 Jan 2015     Volume: 11 Number: 4352

Today's topics:
    Re: fields <jurgenex@hotmail.com>
    Re: fields georgios.mpouras@gmail.com
    Re: fields <rweikusat@mobileactivedefense.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Sun, 18 Jan 2015 18:15:26 -0800
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: fields
Message-Id: <mcoobahb9o9etv87il624mqoiseh6c10cq@4ax.com>

George Mpouras <gravitalsun@hotmail.foo> wrote:
>I want the fastest way to grub the fields from a huge ; delimited file.
>The fields can be randomly quoted or not. 

Obviously there are additional restrictions for your data, otherwise
your solution below would yield wrong results.

>currently I use the following
>
>use strict;
>use warnings;
>
>my $regex_split   = qr/^(.*?);(.*?);(.*?);(.*?)$/o;

So, obviously there are always 4 fields. You forgot to mention that in
your specification.
And the data itself never contains a semicolon. You forgot to mention
that, too.

>my $regex_dequote = qr/^"([^"\\]++|\\.)*+"$/o;

This looks overly complicated but maybe I am just missing the forest for
the trees. Wouldn't a simple	
	my $regex_dequote = qr/^"(.*)"$/o;
work just as well?
Remember: REs are expensive. So try to use as simple REs as possible.

>while (<DATA>) {
>chomp;
>my @col = $_ =~ $regex_split or die;

Given those additional restrictions a simple
	my @col = split /;/ , $_;
will probably be faster because the RE is so much simpler.
If you insist on testing for exactly 4 fields then you can just check
the length of @col.

However, in any case: I doubt that splitting the file into its lines and
individual fields is actually the bottle neck. Reading the file from an
external device like a HD is probably much slower than such simple text
operations.

And I would always value correctness above speed and therefore use one
of the tried and time-tested Text::CSV modules.

>s/$regex_dequote/$1/ foreach @col;

jue


------------------------------

Date: Mon, 19 Jan 2015 00:09:42 -0800 (PST)
From: georgios.mpouras@gmail.com
Subject: Re: fields
Message-Id: <10edc661-b95e-4e09-a664-596fbb6a87bb@googlegroups.com>

=CE=A4=CE=B7 =CE=94=CE=B5=CF=85=CF=84=CE=AD=CF=81=CE=B1, 19 =CE=99=CE=B1=CE=
=BD=CE=BF=CF=85=CE=B1=CF=81=CE=AF=CE=BF=CF=85 2015 - 4:15:31 =CF=80.=CE=BC.=
 UTC+2, =CE=BF =CF=87=CF=81=CE=AE=CF=83=CF=84=CE=B7=CF=82 J=C3=AF=C2=BF=C2=
=BDrgen Exner =CE=AD=CE=B3=CF=81=CE=B1=CF=88=CE=B5:
> George Mpouras <gravitalsun@hotmail.foo> wrote:
> >I want the fastest way to grub the fields from a huge ; delimited file.
> >The fields can be randomly quoted or not.=20
>=20
> Obviously there are additional restrictions for your data, otherwise
> your solution below would yield wrong results.
>=20
> >currently I use the following
> >
> >use strict;
> >use warnings;
> >
> >my $regex_split   =3D qr/^(.*?);(.*?);(.*?);(.*?)$/o;
>=20
> So, obviously there are always 4 fields. You forgot to mention that in
> your specification.
> And the data itself never contains a semicolon. You forgot to mention
> that, too.
>=20
> >my $regex_dequote =3D qr/^"([^"\\]++|\\.)*+"$/o;
>=20
> This looks overly complicated but maybe I am just missing the forest for
> the trees. Wouldn't a simple=09
> 	my $regex_dequote =3D qr/^"(.*)"$/o;
> work just as well?
> Remember: REs are expensive. So try to use as simple REs as possible.
>=20
> >while (<DATA>) {
> >chomp;
> >my @col =3D $_ =3D~ $regex_split or die;
>=20
> Given those additional restrictions a simple
> 	my @col =3D split /;/ , $_;
> will probably be faster because the RE is so much simpler.
> If you insist on testing for exactly 4 fields then you can just check
> the length of @col.
>=20
> However, in any case: I doubt that splitting the file into its lines and
> individual fields is actually the bottle neck. Reading the file from an
> external device like a HD is probably much slower than such simple text
> operations.
>=20
> And I would always value correctness above speed and therefore use one
> of the tried and time-tested Text::CSV modules.
>=20
> >s/$regex_dequote/$1/ foreach @col;
>=20
> jue



1) actually the fields are a lot more, this is only the core info
2) my regex gives always correct results (but is is slow)
3) split is statistical (much) slower than the regex
4) you regex is giving wrong results e.g

   my ($var) =3D '"hello"er"' =3D~ /^"(.*)"$/;
   print $var;

g.bouras


------------------------------

Date: Mon, 19 Jan 2015 13:54:42 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: fields
Message-Id: <87r3uqri3h.fsf@doppelsaurus.mobileactivedefense.com>

George Mpouras <gravitalsun@hotmail.foo> writes:
> I want the fastest way to grub the fields from a huge ; delimited
> file.

Didn't we already have this last time? An algorithm which is
more tuned to the actual data in question will be faster than one
designed to do well for more general cases. Eg, for the information
below, I doubt that 'a solution' (in Perl) can be much faster than

----------
print <<T
a1
c1
d1
a2
b2
d2
b3
a4
b4
c4
d4
T
---------

It won't work for any other input, but "there ain't no such thing as a
free lunch".

[...]

> my $regex_split   = qr/^(.*?);(.*?);(.*?);(.*?)$/o;
> my $regex_dequote = qr/^"([^"\\]++|\\.)*+"$/o;

The /o is pointless here as the qr// is evaluated only once,
anyway. Further, as already determined in the past, even when actually
interpolating something into the regex, qr// isn't particularly fast. Using
it is absolutely ridicolous for static regexes,

---------
use Benchmark qw(cmpthese);

my $a = 'ab' x 15;
my $re = qr/bab$/;

cmpthese(-3,
	  {
	   qr => sub {
	       $a =~ /$re/;
	   },

	   re => sub {
	       $a =~ /bab$/;
	   }});


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 4352
***************************************


home help back first fref pref prev next nref lref last post