[33061] in Perl-Users-Digest
Perl-Users Digest, Issue: 4337 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Dec 27 03:09:14 2014
Date: Sat, 27 Dec 2014 00:09:01 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Sat, 27 Dec 2014 Volume: 11 Number: 4337
Today's topics:
Re: fields separation <rweikusat@mobileactivedefense.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Fri, 26 Dec 2014 18:37:52 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: fields separation
Message-Id: <87egrmqmtb.fsf@doppelsaurus.mobileactivedefense.com>
George Mpouras <gravitalsun@hotmail.foo> writes:
> On 26/12/2014 00:06, Rainer Weikusat wrote:
>> George Mpouras <gravitalsun@hotmail.foo> writes:
>>> the following Rainer is 52% faster than your last version (produce the
>>> same rsults) and I have a feeling that it can become even faster
>>
>> [per-char processing via split(//, ...)]
>>
>>
>
>>
>> It's still slower than yours for the posted test data, though.
>
> Yes it depends on the data. my split // is slow.
> the compiled qr/ ... / regex will help your code a litle,
It won't. perl caches a dynamic regex in the opcode and won't recompile
anything unless the variable changes. And getting the compiled regex out
of the opcode is faster than getting a compiled qr-regex. That's
something which already featured here a while ago (and I tested it
nevertheless).
> but I am thinking of a completely different aproach using regex and
> states.
'An approach with regex and states' was exactly what I posted (since
there are only two states, the value of the $gather variables was used
to switch been 'input gathering' and 'separator scanning' modes). The
main problem with that was the very expensive separator recognition
loop. This can be improved by using the last character of the separator
in the first re and then check the field backwards as in your code.
Splitting the input and processing it char-by-char is still faster when
the average field length is very short (2.4 characters in the posted
example) but once it goes beyond six, using the regex-engine instead
becomes faster.
benchmark example
-----------------
use Benchmark qw(cmpthese);
use constant BLOCK => 64;
my $sep = '<*>';
cmpthese(-3,
{
find_fields => sub {
my ($in, $field, $ls, $ns);
my @fields;
seek(STDIN, 0, 0);
$ls = $re = substr($sep, -1, 1);
$re = '\\\\' if $re eq '\\';
$re = '^([^'.$re.'\n]+)';
$ns = length($sep);
while (length($in) || sysread(STDIN, $in, BLOCK) > 0) {
$in =~ s/$re// and $field .= $1, next;
$in =~ s/^\n+// and next;
$field .= $ls;
if (substr($field, -$ns) eq $sep) {
push(@fields, substr($field, 0, -$ns));
$field = '';
}
substr($in, 0, 1, '');
}
push(@fields, $field) if length($field);
# print(map{$_, "\n" } @fields);
# exit(0);
return @fields;
},
george => sub {
my $lensep = length $sep;
my $field;
my $rawdata;
my @fields;
seek(STDIN, 0, 0);
while (read STDIN, $rawdata, BLOCK) {
foreach (split //, $rawdata) {
next if /\v/;
$field .= $_;
if ($sep eq substr $field, -$lensep) {
$field = substr $field, 0, -$lensep;
push(@fields, $field);
$field = ''
}
}
}
push(@fields, $field) if length($field);
# print(map{$_, "\n" } @fields);
# exit(0);
return @fields;
}});
---------------
input
original
-------------
aaaa<
*>bb<*>1
23<*
>cc
c<*>dddd<*>
<
*
>
ee
<*
><
*>
ee
<*>
ffff
<*
>
------------
longer fields
------------
rraaaaaa<
*>uuibbb<*>1
2222u23<*
>cc44c
yyuuc<*>dyyddddd<*>
<
*
>
eeeeee
<*
><
*>
eyeee888
<*>
ffeef88f
<*
>
----------
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 4337
***************************************