[33084] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 4360 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Jan 31 05:17:21 2015

Date: Sat, 31 Jan 2015 02:17:07 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sat, 31 Jan 2015     Volume: 11 Number: 4360

Today's topics:
    Re: Compare 2 Hash of Hashes using sub-hash values <fmassion@web.de>
    Re: Compare 2 Hash of Hashes using sub-hash values <ben.usenet@bsb.me.uk>
    Re: Compare 2 Hash of Hashes using sub-hash values <fmassion@web.de>
    Re: Compare 2 Hash of Hashes using sub-hash values <ben.usenet@bsb.me.uk>
    Re: Global, Local, Static <rweikusat@mobileactivedefense.com>
    Re: Global, Local, Static <jblack@nospam.com>
    Re: Global, Local, Static <hjp-usenet3@hjp.at>
    Re: Multi-field sorting (general case) <gravitalsun@hotmail.foo>
    Re: Multi-field sorting (general case) <rweikusat@mobileactivedefense.com>
    Re: Multi-field sorting (general case) <bjoern@hoehrmann.de>
        relative speed of regex vs split for splitting on white <rweikusat@mobileactivedefense.com>
    Re: relative speed of regex vs split for splitting on w <gravitalsun@hotmail.foo>
        SOLVED: Compare 2 Hash of Hashes using sub-hash values <fmassion@web.de>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Thu, 29 Jan 2015 06:26:19 -0800 (PST)
From: F Massion <fmassion@web.de>
Subject: Re: Compare 2 Hash of Hashes using sub-hash values
Message-Id: <5d505b8e-68cd-4406-b20c-781b45503994@googlegroups.com>

Am Donnerstag, 29. Januar 2015 14:50:45 UTC+1 schrieb Ben Bacarisse:
> F Massion <fmassion@web.de> writes:
> 
> > Am Donnerstag, 29. Januar 2015 13:15:23 UTC+1 schrieb Ben Bacarisse:
> >> F Massion <fmassion@web.de> writes:
> >> 
> >> > Am Donnerstag, 29. Januar 2015 02:20:33 UTC+1 schrieb Ben Bacarisse:
> >> >> F Massion <fmassion@web.de> writes:
> >> >> 
> >> >> > I am not particularly good at Perl and do not get any further with the
> >> >> > following: I want to compare 2 relatively large tables with
> >> >> > tab-separated columns which I have imported each in a Hash of Hashes.
> >> >> >
> >> >> > File #1:
> >> >> > RecordID	Name	Language	Status
> >> >> > 715	surface technology	English	 
> >> >> > 763	mineral abrasive	English	validated
> >> >> > 796	Zinkstaubgrundierung	German	validated
> >> >> >
> >> >> > File #2:
> >> >> > RecordID	Name	Language	Status
> >> >> > 763	mineral abrasive	English	not validated
> >> >> > 813	2-component prime coat	English	not validated
> >> >> > 815	finishing coat	English	 
> >> >> >
> >> >> > I would like to use the value of certain columns to find out what are
> >> >> > the differences or the common entries between these 2 files.
> >> >> <snip>
> >> >> > This is my code:
> >> >> > #!/usr/bin/perl -w
> >> >> > use warnings; use strict;
> >> >> > open(FIRSTFILE, $ARGV[0]) || die("file $ARGV[0] cannot be opened!\n");
> >> >> > my %Firstfile = ();
> >> >> > while (<FIRSTFILE>) 
> >> >> > {
> >> >> >     chomp;
> >> >> >     my ($RecordID, $Name, $Language, $Status) = split ('\t');
> >> >> >     $Firstfile{$RecordID}{'Name'} = $Name;
> >> >> >     $Firstfile{$RecordID}{'Language'} = $Language;
> >> >> >     $Firstfile{$RecordID}{'Status'} = $Status;
> >> >> >     print "$RecordID -- $Firstfile{$RecordID}{'Name'} ::
> >> >> > $Firstfile{$RecordID}{'Language'}\n" ;
> >> >> > }
> >> >> > open(SECONDFILE, $ARGV[1]) || die("file $ARGV[1] cannot be opened!\n");
> >> >> > my %Secondfile = ();
> >> >> > while (<SECONDFILE>) 
> >> >> > {
> >> >> >     chomp;
> >> >> >     my ($RecordID, $Name, $Language, $Status) = split ('\t');
> >> >> >     $Secondfile{$RecordID}{'Name'} = $Name;
> >> >> >     $Secondfile{$RecordID}{'Language'} = $Language;
> >> >> >     $Secondfile{$RecordID}{'Status'} = $Status;
> >> >> >     print "$RecordID -- $Secondfile{$RecordID}{'Name'} ::
> >> >> > $Secondfile{$RecordID}{'Status'}\n";
> >> >> > }
> >> >> > #QUESTIONS
> >> >> > #(1) Why do I lose the variables from the 2 loops above? I.e. I cannot
> >> >> > re-use $RecordID or $Name...
> >> >> 
> >> >> These are declared as local variables inside the body of the loop.  You
> >> >> do that because you don't want the variables to be available outside.
> >> >> You could more the declaration to outside the loop, but why?  What do
> >> >> you want them for?
> >> >> 
> >> >
> >> > My aim is to fine-tune the comparison of these two tables. In real
> >> > life these are exports from terminology databases with thousands of
> >> > records and Attribute values. Thus I would like to say for example:
> >> > "Which records in File#2 have the same RecordID as in File#2 but a
> >> > different Status. Please Output only the Record ID, the Name and the
> >> > Status as a tab-delimited file". This file can then be processed by
> >> > the terminology database. In order to do that type of oparation I need
> >> > a fine granularity than $inkey / $outkey + $compound_outvalue.
> >> 
> >> Sorry, I'm lost.  The problem I am having is that your code is largely
> >> correct.  You do this (slightly reformatted for line length) which tests
> >> exactly what you want to test:
> >> 
> >>   foreach $innerkey ( keys %Firstfile ) {
> >>     if ($Firstfile{$innerkey}{'Status'} ne
> >>         $Secondfile{$innerkey}{'Status'}) {
> >>            ...
> >>     }
> >>   }
> >> 
> >> but the code then tries to print more than you apparently need.  Is the
> >> problem just that you don't know how to write
> >> 
> >>   print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n";
> >> 
> >> ?  (You don't say which of the two different statuses you want but
> >> that's trivial to change.) 
> >
> > This is the amended Loop:
> >    foreach $innerkey ( keys %Firstfile ) {
> >    	if ($Firstfile{$innerkey}{'Status'} ne $Secondfile{$innerkey}{'Status'}){ 
> >    print "$innerkey:\:" ;
> >    foreach $outerkey ( keys %{$Firstfile{$innerkey}} ) {
> > #   print "$Firstfile{$innerkey}{$outerkey}::" ;  = my "old Version"
> >    print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n"; # = your suggestion
> >    }
> >    print "\n" ;
> >  }}
> >
> > I get error messages about uninitialized values. I guess this value is $Secondfile{$innerkey}
> >
> >    Use of uninitialized value in string ne  line 31, <SECONDFILE> line 4.
> >    Use of uninitialized value in concatenation (.) or string  line 35, <SECONDFILE> line 4.
> >
> > Maybe I want to do something impossible.
> 
> No.  The code is fine, but there are some rough edges.  The warnings are
> just that -- warnings.  You can turn then off (remove -w and "use
> warnings" from the script) or, better, you can fix them.  They refer to
> the fact that some bits of the data are missing.
> 
> What do you want to do, for example, when the second file has no
> corresponding record for an ID in the first?  Is that considered to be a
> different status or something else?  Do you want to ignore it, or report
> it?  

It will depends on the individual situatoin. I assume, if the code you suggest works fine, I can change the Terms of the query as needed.

To ignore it, you might do this:
> 
> foreach $innerkey ( keys %Firstfile ) {
>     next unless exists $Secondfile{$innerkey};
>     if ($Firstfile{$innerkey}{'Status'} ne $Secondfile{$innerkey}{'Status'}) { 
>         print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n" ;
>     }
> }
> 
> The key test being "exists $Secondfile{$innerkey}".  You can alter the
> logic to report a missing record or do whatever else your situation
> demands.
> 

I have implemented your code.

The output is:

763::763        not validated
763     not validated
763     not validated

The optimized output would be:

763     not validated
but in that case I need to output the difference only once. This is important to solve because in practice I am dealing with sth. like 35,000+ records per file.

Thanks alot for your input.


> <snip>
> -- 
> Ben.



------------------------------

Date: Thu, 29 Jan 2015 17:07:40 +0000
From: Ben Bacarisse <ben.usenet@bsb.me.uk>
Subject: Re: Compare 2 Hash of Hashes using sub-hash values
Message-Id: <874mr97bxf.fsf@bsb.me.uk>

F Massion <fmassion@web.de> writes:

> Am Donnerstag, 29. Januar 2015 14:50:45 UTC+1 schrieb Ben Bacarisse:
<snip>
>> No.  The code is fine, but there are some rough edges.  The warnings are
>> just that -- warnings.  You can turn then off (remove -w and "use
>> warnings" from the script) or, better, you can fix them.  They refer to
>> the fact that some bits of the data are missing.
>> 
>> What do you want to do, for example, when the second file has no
>> corresponding record for an ID in the first?  Is that considered to be a
>> different status or something else?  Do you want to ignore it, or report
>> it?  
>
> It will depends on the individual situatoin. I assume, if the code you
> suggest works fine, I can change the Terms of the query as needed.
>
> To ignore it, you might do this:
>> 
>> foreach $innerkey ( keys %Firstfile ) {
>>     next unless exists $Secondfile{$innerkey};
>>     if ($Firstfile{$innerkey}{'Status'} ne $Secondfile{$innerkey}{'Status'}) { 
>>         print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n" ;
>>     }
>> }
>> 
>> The key test being "exists $Secondfile{$innerkey}".  You can alter the
>> logic to report a missing record or do whatever else your situation
>> demands.
>> 
>
> I have implemented your code.
>
> The output is:
>
> 763::763        not validated
> 763     not validated
> 763     not validated

I am not sure what you are actually running anymore.

> The optimized output would be:
>
> 763     not validated

Right.  That's what I get with my loop and your test data.  Obviously I
removed the other output statements from the data reading loops.

<snip>
-- 
Ben.


------------------------------

Date: Thu, 29 Jan 2015 10:43:31 -0800 (PST)
From: F Massion <fmassion@web.de>
Subject: Re: Compare 2 Hash of Hashes using sub-hash values
Message-Id: <8cec18ab-bf53-4b49-b3a5-45b623425639@googlegroups.com>

Am Donnerstag, 29. Januar 2015 18:07:44 UTC+1 schrieb Ben Bacarisse:
> F Massion <fmassion@web.de> writes:
> 
> > Am Donnerstag, 29. Januar 2015 14:50:45 UTC+1 schrieb Ben Bacarisse:
> <snip>
> >> No.  The code is fine, but there are some rough edges.  The warnings are
> >> just that -- warnings.  You can turn then off (remove -w and "use
> >> warnings" from the script) or, better, you can fix them.  They refer to
> >> the fact that some bits of the data are missing.
> >> 
> >> What do you want to do, for example, when the second file has no
> >> corresponding record for an ID in the first?  Is that considered to be a
> >> different status or something else?  Do you want to ignore it, or report
> >> it?  
> >
> > It will depends on the individual situatoin. I assume, if the code you
> > suggest works fine, I can change the Terms of the query as needed.
> >
> > To ignore it, you might do this:
> >> 
> >> foreach $innerkey ( keys %Firstfile ) {
> >>     next unless exists $Secondfile{$innerkey};
> >>     if ($Firstfile{$innerkey}{'Status'} ne $Secondfile{$innerkey}{'Status'}) { 
> >>         print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n" ;
> >>     }
> >> }
> >> 
> >> The key test being "exists $Secondfile{$innerkey}".  You can alter the
> >> logic to report a missing record or do whatever else your situation
> >> demands.
> >> 
> >
> > I have implemented your code.
> >
> > The output is:
> >
> > 763::763        not validated
> > 763     not validated
> > 763     not validated
> 
> I am not sure what you are actually running anymore.
> 
> > The optimized output would be:
> >
> > 763     not validated
> 
> Right.  That's what I get with my loop and your test data.  Obviously I
> removed the other output statements from the data reading loops.
> 

Thanks Ben, you are such a great help! 
This is the only remaining 'print' command and I get three times the same result. Where is the error?

foreach $innerkey ( keys %Firstfile ) {
   	next unless exists $Secondfile{$innerkey};
   	if ($Firstfile{$innerkey}{'Status'} ne $Secondfile{$innerkey}{'Status'}){ 
   foreach $outerkey ( keys %{$Firstfile{$innerkey}} ) {
   print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n"; 
   }}
}


> <snip>
> -- 
> Ben.



------------------------------

Date: Thu, 29 Jan 2015 22:56:38 +0000
From: Ben Bacarisse <ben.usenet@bsb.me.uk>
Subject: Re: Compare 2 Hash of Hashes using sub-hash values
Message-Id: <877fw55h7d.fsf@bsb.me.uk>

F Massion <fmassion@web.de> writes:

> Am Donnerstag, 29. Januar 2015 18:07:44 UTC+1 schrieb Ben Bacarisse:
<snip>
>> > To ignore it, you might do this:
>> >> 
>> >> foreach $innerkey ( keys %Firstfile ) {
>> >>     next unless exists $Secondfile{$innerkey};
>> >>     if ($Firstfile{$innerkey}{'Status'} ne $Secondfile{$innerkey}{'Status'}) { 
>> >>         print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n" ;
>> >>     }
>> >> }
<snip>
>> > The optimized output would be:
>> >
>> > 763     not validated
>> 
>> Right.  That's what I get with my loop and your test data.  Obviously I
>> removed the other output statements from the data reading loops.
>> 
>
> Thanks Ben, you are such a great help! 
> This is the only remaining 'print' command and I get three times the same result. Where is the error?
>
> foreach $innerkey ( keys %Firstfile ) {
>    	next unless exists $Secondfile{$innerkey};
>    	if ($Firstfile{$innerkey}{'Status'} ne $Secondfile{$innerkey}{'Status'}){ 
>    foreach $outerkey ( keys %{$Firstfile{$innerkey}} ) {
>    print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n"; 
>    }}
> }

You have an extra for loop.  I've left the code I suggested in the
quoted material.

-- 
Ben.


------------------------------

Date: Thu, 29 Jan 2015 16:47:13 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Global, Local, Static
Message-Id: <87h9v9a60e.fsf@doppelsaurus.mobileactivedefense.com>

Robbie Hatley <see.my.sig@for.my.address> writes:
> On 1/27/2015 5:59 AM, Rainer Weikusat wrote:
>
>> > So....  any way of doing that [retaining variable values between
>> > calls] in Perl subroutines?
>>
>> Two, actually. A 'traditional' one, create a my variable in a scope only
>> the subroutine can access,

[...]

> #! /usr/bin/perl
> use v5.14;
> use strict;
> use warnings;
> use bignum;
> BEGIN {
>    my $i = 1;
>    printf("%80s\n", $i);
>    my $j = 1;
>    printf("%80s\n", $j);
>    sub Fibonacci {
>       my $k = $i + $j;
>       $i = $j;
>       $j = $k;
>       printf("%80s\n", $k);
>    }
> }
> for (1..350) {
>    Fibonacci;
> }

This is a special case of a more general Perl feature, namely, support
for lexical closures: A subroutine created in a certain scope will retain
access to any lexical variables accessible at the time it was
created. Silly example:

--------
sub make_whiner
{
    my $calls;

    return sub {
	++$calls;
	print("\t$calls times already, and they keep calling me!\n")
	    if $calls > 5;

	print("Your servant!\n");
    };
}

my ($whiner_a, $whiner_b) = (make_whiner(), make_whiner());

$whiner_a->() for 0 .. 4;

for (1 .. 10) {
    print("\na\n-\n");
    $whiner_a->();
    
    print("\nb\n-\n");
    $whiner_b->();
}
---------
    
This is extremely useful in a single-threaded, multi-tasking
environment, eg, some kind of 'server' structured around an event
processing loop[*].

[*] Someone who's a clueless 'imperative programmer' will extend this
into a(nother) general 'cooperative userspace threading' library, a
clueless 'functional programmer' will ultimatively come up with
'generalized continuations' ...

[...]

> local    =>  temporary value; revert when leave scope

The important thing about local is that it creates a lexcially scoped
binding for a dynamic symbol, eg,

---------
our $master;

sub servant
{
    print("My master is $master\n");
}

sub mr_melon
{
    local $master = "Mr Melon";
    servant();
}

sub general_duckhead
{
    local $master = "General Duckhead";
    servant();
}

mr_melon();
general_duckhead();
---------

This can also be used to create a (localized) name for an anonymous
object (by assigning the reference to that to the corresponding glob):

---------
our @list;

sub print_list
{
    local *list = $_[0];
    print("$_\n") for @list;
}

print_list([1,2,3,4]);
print_list([qw(a b c d)]);
---------


------------------------------

Date: Fri, 30 Jan 2015 16:44:26 -0600
From: John Black <jblack@nospam.com>
Subject: Re: Global, Local, Static
Message-Id: <MPG.2f35c52cca97a668989810@news.eternal-september.org>

In article <87h9v9a60e.fsf@doppelsaurus.mobileactivedefense.com>, 
rweikusat@mobileactivedefense.com says...
> > local    =>  temporary value; revert when leave scope
> 
> The important thing about local is that it creates a lexcially scoped
> binding for a dynamic symbol, eg,
> 
> ---------
> our $master;
> 
> sub servant
> {
>     print("My master is $master\n");
> }
> 
> sub mr_melon
> {
>     local $master = "Mr Melon";
>     servant();
> }
> 
> sub general_duckhead
> {
>     local $master = "General Duckhead";
>     servant();
> }
> 
> mr_melon();
> general_duckhead();
> ---------

I don't really get how this is different from the case where $master is global.  If I remove 
the 'our $master' and remove the words 'local', I get the exact same results.

> This can also be used to create a (localized) name for an anonymous
> object (by assigning the reference to that to the corresponding glob):
> 
> ---------
> our @list;
> 
> sub print_list
> {
>     local *list = $_[0];
>     print("$_\n") for @list;
> }
> 
> print_list([1,2,3,4]);
> print_list([qw(a b c d)]);

Again, I can easily pass a list to a subroutine and print it without need for 'our' or 
'local'.  I'm missing what they are buying me...

John Black


------------------------------

Date: Sat, 31 Jan 2015 10:48:12 +0100
From: "Peter J. Holzer" <hjp-usenet3@hjp.at>
Subject: Re: Global, Local, Static
Message-Id: <slrnmcp96s.jto.hjp-usenet3@hrunkner.hjp.at>

On 2015-01-30 22:44, John Black <jblack@nospam.com> wrote:
> In article <87h9v9a60e.fsf@doppelsaurus.mobileactivedefense.com>, 
> rweikusat@mobileactivedefense.com says...
>> > local    =>  temporary value; revert when leave scope
>> 
>> The important thing about local is that it creates a lexcially scoped
>> binding for a dynamic symbol, eg,

Hmm. I understand what you mean, but that's a rather confusing way of
describing it.


>> ---------
>> our $master;
>> 
>> sub servant
>> {
>>     print("My master is $master\n");
>> }
>> 
>> sub mr_melon
>> {
>>     local $master = "Mr Melon";
>>     servant();
>> }
>> 
>> sub general_duckhead
>> {
>>     local $master = "General Duckhead";
>>     servant();
>> }
>> 
>> mr_melon();
>> general_duckhead();
>> ---------
>
> I don't really get how this is different from the case where $master
> is global.  If I remove the 'our $master' and remove the words
> 'local', I get the exact same results.

Change the last two lines to:

$master = "nobody";
servant();
mr_melon();
servant();
general_duckhead();
servant();

and you will see the difference.

>
>> This can also be used to create a (localized) name for an anonymous
>> object (by assigning the reference to that to the corresponding glob):
>> 
>> ---------
>> our @list;
>> 
>> sub print_list
>> {
>>     local *list = $_[0];
>>     print("$_\n") for @list;
>> }
>> 
>> print_list([1,2,3,4]);
>> print_list([qw(a b c d)]);
>
> Again, I can easily pass a list to a subroutine and print it without
> need for 'our' or 'local'.  I'm missing what they are buying me...

“our” gives you global variables. Or rather it gives you a way to
*declare* a global variable, since global variables are the default in
Perl.

“local” gives you dynamically scoped local variables. Before “my” was
introduced, this was the only kind of local variables Perl had. So
historically, the question wasn't “why would we need ‘local’ when we
already have ‘my’” but the other way around - and the answer was of
course that you can *see* the the scope of a lexically scoped variable on
the screen, while you have to track calls to subroutines to determine
the dynamic scope. So normally you use lexical scoping with “my”.

Still, “local” is occasionally useful, for two reasons:

* Perl has a ton of built-in global variables, which you might want to
  temporarily override. For example perldoc perlvar recommends to
  override $/ like this:

        my $content = '';
        open my $fh, "<", "foo" or die $!;
        {
            local $/;
            $content = <$fh>;
        }
        close $fh;

  At the closing brace, $/ reverts to the value it had before, so you
  don't have to worry about some other code which might use $/.

* “local” works on any lvalue, not just variables. So for example if I
  want to temporarily prevent DBI from throwing exceptions, I can do
  something like this:

    # $dbh->{AutoRaise} might be true or false here.

    {
        $local $dbh->{AutoRaise} = 0;

        # do something which might otherwise raise an exception and just
        # check the return value
    }

    # AutoRaise has the previous value again.


        hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) |                    | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


------------------------------

Date: Thu, 29 Jan 2015 17:44:06 +0200
From: George Mpouras <gravitalsun@hotmail.foo>
Subject: Re: Multi-field sorting (general case)
Message-Id: <madkf8$5sa$1@news.ntua.gr>

On 29/1/2015 03:50, James wrote:
> The goal is to sort by multiple fields. Got an idea from other post.
>
> my-prog 0 1n 2
> => first alphabetically field-0, then numerically field-1, then alphabetically field-3
> (1n means field-1 is numerically sorted)
> See code below.
>
> How can I generalize this for arbitrary N-fields? For example, for N=4 (0,1,2,3),
> my-prog 2 3n 0 1n

Your idea is very good


use strict;
use warnings;
my $Sorter = BuildSorter( qr/(\S+)/, 'cmp', '<=>', 'cmp');

for (sort $Sorter <DATA> ) {print}

sub BuildSorter
{
my $regex = shift;
my @code;
for (my $i=0; $i<@_;$i++) {
push @code, '($a=~/$regex/g)'. "[$i] $_[$i] ". '($b=~/$regex/g)['.$i.']'}
eval 'sub{'. join(' or ', @code). '}'
}
	

__DATA__
yy      7       perl
xx      101     hello
yy      33      world
xx      33      world
yy      101     hello
yy      7       hello



------------------------------

Date: Thu, 29 Jan 2015 19:22:35 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Multi-field sorting (general case)
Message-Id: <87a9119ytg.fsf@doppelsaurus.mobileactivedefense.com>

James <hslee911@yahoo.com> writes:
> The goal is to sort by multiple fields. Got an idea from other post.
>
> my-prog 0 1n 2
> => first alphabetically field-0, then numerically field-1, then alphabetically field-3
> (1n means field-1 is numerically sorted)
> See code below.
>
> How can I generalize this for arbitrary N-fields? For example, for N=4 (0,1,2,3),
> my-prog 2 3n 0 1n
> etc.

A naive approach similar to the one George posted, just with less
insistence pm doing things backwards for the sake of ... well ... doing
them backwards,

--------------
use constant CMPS_TMPL => 	'sub { my (@fa, @fb); @fa = $a =~ /(\S+)/g; @fb = $b =~ /(\S+)/g; %s; }'; 
use constant CMP_TMPL =>	'$fa[%d] %s $fb[%d]';

sub build_cmp
{
    my $spec = $_[0];
    my @cmp;

    for ($spec =~ /(\S+)/g) {
	/(\d+)(.*)/;
	push(@cmp, sprintf(CMP_TMPL, $1, $2 eq 'n' ? '<=>' : 'cmp', $1));
    }

    eval(sprintf(CMPS_TMPL, join(' or ', @cmp)));
}

my $cmp = build_cmp('0 1n 2');

print(sort $cmp <DATA>);

__DATA__
yy      7       perl
xx      101     hello
yy      33      world
xx      33      world
yy      101     hello
yy      7       hello
----------------

A drawback of this is that sort will run the comparison routine more
than once for every element of the input list and each time it runs, it
splits both string arguments into fields, despite they were already
splitted into field the last time sort was called with a particular
argument. This can be avoided by doing a so-called 'Schwartzian
transform' of the input data, eg,

-----------------
use constant CMP_TMPL =>	'$a->[1][%d] %s $b->[1][%d]';

sub build_cmp
{
    my @cmp;

    for ($_[0] =~ /(\S+)/g) {
	/(\d+)(.*)/;
	push(@cmp, sprintf(CMP_TMPL, $1, $2 eq 'n' ? '<=>' : 'cmp', $1));
    }

    eval(sprintf('sub { %s }', join(' or ', @cmp)));
}

sub field_sort($@)
{
    my $cmp = shift;

    map { $_->[0]; } sort $cmp (map { [$_, [/(\S+)/g]] } @_);
}

my $cmp = build_cmp('0 1n 2');
print(field_sort($cmp, <DATA>));

__DATA__
yy      7       perl
xx      101     hello
yy      33      world
xx      33      world
yy      101     hello
yy      7       hello
------------------


------------------------------

Date: Thu, 29 Jan 2015 23:24:07 +0100
From: Bjoern Hoehrmann <bjoern@hoehrmann.de>
Subject: Re: Multi-field sorting (general case)
Message-Id: <a2clca9q6gqsi52boe5u8vp5u836ai49mb@hive.bjoern.hoehrmann.de>

* James wrote in comp.lang.perl.misc:
>The goal is to sort by multiple fields. Got an idea from other post.
>
>my-prog 0 1n 2
>=> first alphabetically field-0, then numerically field-1, then alphabetically field-3
>(1n means field-1 is numerically sorted)
>See code below.
>
>How can I generalize this for arbitrary N-fields? For example, for N=4 (0,1,2,3),
>my-prog 2 3n 0 1n
>etc.

My http://search.cpan.org/perldoc?List::OrderBy module lets you write

  my @sorted = order_cmp_by { $fields[0] }
                    then_by { $fields[1] }
                then_cmp_by { $fields[3] } @unsorted;

The `then_` functions return an object that is ultimately used by the
`order_` functions that sorts the elements by first extracting a key
followed by a `sort` call that loops through all "levels" until it can
distinguish two elements.
-- 
Bjrn Hhrmann  mailto:bjoern@hoehrmann.de  http://bjoern.hoehrmann.de
D-10243 Berlin  PGP Pub. KeyID: 0xA4357E78  http://www.bjoernsworld.de
 Available for hire in Berlin (early 2015)   http://www.websitedev.de/ 


------------------------------

Date: Fri, 30 Jan 2015 15:39:55 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: relative speed of regex vs split for splitting on whitespace
Message-Id: <87oapgwa44.fsf@doppelsaurus.mobileactivedefense.com>

A past posting by George Mpouras claimed that using the regex engine to
split a line into fields would be much faster than using split (IIRC).
Since I wanted to know if this is actually true, I wrote the following
test program:

-------------
use Benchmark;

use constant MAX_TOKENS =>	5;
use constant MAX_TOKEN =>	10;
use constant MAX_WS =>		1;

use constant N_LINES =>		10;

sub make_token
{
    return 'A' x (rand(MAX_TOKEN) + 1);
}

sub make_sep
{
    return ' ' x (rand(MAX_WS) + 1);
}

sub make_line
{
    my $l;

    $l = make_token();
    $l .= make_sep().make_token() for 1 .. rand(MAX_TOKENS);

    return $l;
}

my @lines = map { make_line(); } 1 .. N_LINES;

#print(map {$_, "\n" } @lines);

timethese(-3,
	  {
	   split => sub {
	       my @a;
	       @a = split for @lines;
	   },

	   regex => sub {
	       my @a;
	       @a = /\S+/g for @lines;
	   }});
-------------

At least for me, the result of running it was that split was always
faster no matter what combination of line parameters I tried.


------------------------------

Date: Fri, 30 Jan 2015 22:23:04 +0200
From: George Mpouras <gravitalsun@hotmail.foo>
Subject: Re: relative speed of regex vs split for splitting on whitespace
Message-Id: <magp83$11g5$1@news.ntua.gr>

In general Rainer you are correct, the split is faster 99% of the cases. 
I had a misleading set of weird data that gave me this idea.

For example look at the following where the regex is much faster
--------------------------




#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
use feature 'say';

#####################################
##### Generata some sample data #####
#####################################

my $lines         = 10;
my $fields_number = 23;
my $field_length  = 20;
my $separator     = '0' x300;
my @pool          = 'A'..'z';push @pool,'a'..'z'; push @pool,0..9;
push @pool,         split //, '\\`~!@#$%^&*?/;_-+=[]{}()"<|>';
my   @DATA;
for (1..$lines) {
my @fields;

	for (1..$fields_number)	{
	my $field;
	$field .= $pool[int rand scalar @pool] for 1..$field_length;
	push @fields, $field
	}

push @DATA, join($separator,@fields)
}

my @fields;
my $regex;
for (1..$fields_number-1) { $regex .= "(.*?)\Q$separator\E" }
$regex = qr/^$regex(.*?)$/;



###########################
# compare split and regex #
###########################

sub MethodSplit {
foreach (@DATA) {@fields = split /\Q$separator\E/, $_}
}

sub MethodRegex {
foreach (@DATA) {@fields = $_ =~$regex}
}

cmpthese(1000, {
MethodSplit => \&MethodSplit,
MethodRegex => \&MethodRegex
});




------------------------------

Date: Thu, 29 Jan 2015 23:45:27 -0800 (PST)
From: F Massion <fmassion@web.de>
Subject: SOLVED: Compare 2 Hash of Hashes using sub-hash values
Message-Id: <770efd91-11a3-4c6e-824b-50f793471623@googlegroups.com>

Am Donnerstag, 29. Januar 2015 23:56:43 UTC+1 schrieb Ben Bacarisse:
> F Massion <fmassion@web.de> writes:
>=20
> > Am Donnerstag, 29. Januar 2015 18:07:44 UTC+1 schrieb Ben Bacarisse:
> <snip>
> >> > To ignore it, you might do this:
> >> >>=20
> >> >> foreach $innerkey ( keys %Firstfile ) {
> >> >>     next unless exists $Secondfile{$innerkey};
> >> >>     if ($Firstfile{$innerkey}{'Status'} ne $Secondfile{$innerkey}{'=
Status'}) {=20
> >> >>         print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n" ;
> >> >>     }
> >> >> }
> <snip>
> >> > The optimized output would be:
> >> >
> >> > 763     not validated
> >>=20
> >> Right.  That's what I get with my loop and your test data.  Obviously =
I
> >> removed the other output statements from the data reading loops.
> >>=20
> >
> > Thanks Ben, you are such a great help!=20
> > This is the only remaining 'print' command and I get three times the sa=
me result. Where is the error?
> >
> > foreach $innerkey ( keys %Firstfile ) {
> >    	next unless exists $Secondfile{$innerkey};
> >    	if ($Firstfile{$innerkey}{'Status'} ne $Secondfile{$innerkey}{'Stat=
us'}){=20
> >    foreach $outerkey ( keys %{$Firstfile{$innerkey}} ) {
> >    print "$innerkey\t$Secondfile{$innerkey}{'Status'}\n";=20
> >    }}
> > }
>=20
> You have an extra for loop.  I've left the code I suggested in the
> quoted material.
>=20
> --=20
> Ben.

Thanks alot Ben. You saved my day (I owe you a beer if you come to Germany;=
-).=20
The code works perfectly. It is posssible to compare 2 Hashes of Hashes whi=
ch have been created by importing two tab-delimited files with data in tabl=
es. The first column being the inner key of the hash of hashes.

My understanding problem at the beginning was that I didn't realize that I =
was working with Hash REFERENCES (It was written in huge billboard letters =
in some reference material, but somehow I oversaw it). With the proper synt=
ax to address referenced hashes values or keys it is possible to finetune t=
he comparison and the output criteria.

Francois Massion


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 4360
***************************************


home help back first fref pref prev next nref lref last post