[32201] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3466 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Aug 6 09:09:21 2011

Date: Sat, 6 Aug 2011 06:09:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sat, 6 Aug 2011     Volume: 11 Number: 3466

Today's topics:
    Re: Choice of data structure <rweikusat@mssgmbh.com>
    Re: Delaying interpolation in a qr <bernie@fantasyfarm.com>
    Re: Delaying interpolation in a qr <bernie@fantasyfarm.com>
    Re: seeking advice on problem difficulty <ben.usenet@bsb.me.uk>
    Re: seeking advice on problem difficulty <rweikusat@mssgmbh.com>
    Re: seeking advice on problem difficulty <ben.usenet@bsb.me.uk>
    Re: Sorting hash %seen <ela@yantai.org>
    Re: Sorting hash %seen <rweikusat@mssgmbh.com>
    Re: Sorting hash %seen <rweikusat@mssgmbh.com>
    Re: Sorting hash %seen <jimsgibson@gmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 05 Aug 2011 16:11:14 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Choice of data structure
Message-Id: <874o1vyju5.fsf@sapphire.mobileactivedefense.com>

"ela" <ela@yantai.org> writes:
> "Rainer Weikusat" <rweikusat@mssgmbh.com> wrote in message 
> news:87ei11uzui.fsf@sapphire.mobileactivedefense.com...
>> "ela" <ela@yantai.org> writes:
>>> I've been working on this problem for 4 days and still cannot come out a
>>> good solution and would appreciate if you could comment on the problem.
>>>
>>> Given a table containing cells delimited by tab like this
>>
>> [ please see original for the indeed gory details ]
>>
>> Provided I understood the problem correctly, a possible solution could
>> look like this (this code has had very little testing): First, you
>> define your groups by associating array references containing the group
>> members with the 'group ID' with the help of a hash:
>>
>> $grp{1} = [1, 2];
>>
>> Then, you create a hash mapping the column name to the column value
>> for each ID and put these hashes into an id hash associated with the
>> ID:
>>
>> $id{1} = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC4' };
>> $id{2} = { F1 => 'SuperC1', F2 => 'C1',  F3 => 'subC3' };
>
> While I'm revising the codes, I find that just because I overrely on hash 
> and that complicates my problem. What made you make a decision on using 
> array for "group" while hash for "id"?

'Groups' didn't have named columns.



------------------------------

Date: Fri, 05 Aug 2011 10:46:59 -0400
From: Bernie Cosell <bernie@fantasyfarm.com>
Subject: Re: Delaying interpolation in a qr
Message-Id: <puun37lprtfdflpqkcdpur6gl09rt841el@library.airnews.net>

"Uri Guttman" <uri@StemSystems.com> wrote:

} >>>>> "BC" == Bernie Cosell <bernie@fantasyfarm.com> writes:
}   BC> I actually have that..:o).  Problem is that several of the qr's
}   BC> really need to have a variable interpolated.
} 
} but when? you haven't really given a time flow here and explained the
} need for such a delay. i really do expect a simpler solution once i
} understand the reasoning behind the delay. 

As I said [twice I think], it is mostly a tidiness, reusability thing.  I'd
like to have these complicated REs in one place and in addition where they
can be used by several parts of the program.  This works beautifully and
perfectly for qr's that are "static" [that is, with no need for run-time
values to be interpolated] and is a technique I've used for a while.  In
*this* app, I have a similar "block" of qr's at the head of my program, all
nicely commented, etc, but I ran into a problem: a few of them need a
run-time interpolation and that's what I haven't been able to figure out
how to do.

}   BC> my $pat = qr{(...)\ +(\d+)\ (\d\d:\d\d:\d\d).*rip=$targetip,} ;
} 
}   BC> to pick out the records from the logfile from a particular IP
}   BC> addr.  I don't actually know the IP addr I want until I've
}   BC> processed some other logfiles, and so I have the RE set up with
}   BC> the "placeholder" variable.  If I bury the RE down into the
}   BC> subroutine, of course it all works just fine:
} 
}   BC> sub scan
}   BC> {   my $targetip = $_[0] ;
}   BC>     my $pat = qr[that stuff above] ;
}   BC> # and now I can do =~ /$pat/ and it works just fine
}   BC> }
} 
}   BC> but I would *really* like to have the [several similar] REs
}   BC> together at the head of the program where it is easy to find them
}   BC> and tweak them and keep them in sync, etc.
} 
} you can do that. just keep the dynamic late part out of those qr's.

I don't understand.  The "dynamic late part" is in the *middle* of the qr
[the actual qr from which I edited the example above actually goes on for
quite a while after the ',' after $targetip.

> .. when
} the actual value is passed to the sub, make a new regex with the proper
} qr and the new value.

I'd be happy to do that, but I can't quite figure out how.  I have no
problem with my subroutine looking like:
    {   my $targetip = $_[0] ;
        my $fixedRE = qr{<...???...>} ;
            ....
and then using $fixedRE in the code.  That'd be just fine: so what would I
do to my "global" RE so that when I re-qr it in the subroutine the right
thing happens.  Ah, I think I see what you were thinking:

} .... another idea is to make a hash of anon subs
} which return a qr built with one of the sub args. then your code will
} get the desired anon sub, pass in the var and get back a qr you can
} use. but again, this is overkill.

Right, I thought of that: make my "global array" not of qr's but anon subs.
A bit overkill, I agree...

> .. if you know the value only just before
} you need to use it in a regex, just interpolate it then. if it is inside
} a larger regex, create a pre- and post qr for those parts.

Ah, so what you're suggesting is instead of having a block of qr's I have
an array of text strings, I guess and so I might have
    @REs = (['(...)\ +(\d+)\ (\d\d:\d\d:\d\d).*rip=',
                    ',moreoftheRE'],
            ['start of next RE', 'tail of next RE'],
                ...
           ) ;
And then in the subroutine, I might call it as:
    sub scan
    {   my ($whichRE, $ip) = @_ ;
        my $pattern = qr{$REs[$whichRE][0]$ip$REs[$whichRE][1]} ;
         ...
Is that what you were thinking?  I bit less clean that I'd hoped, but it,
indeed works.  THANKS!  Working code:
---------------------
#!/usr/bin/perl

my $vbl = 'abc' ;
my $pat1 = '[!]' ;
my $pat2 = '\d' ;
$vbl = 'def' ;
	
sub check
{   my $vbl = 'ghi' ;
    my $pat = qr{$pat1$vbl$pat2} ;
    warn "first\n" if "!abc1" =~ /$pat/ ;
    warn "second\n" if "!def2" =~ /$pat/ ;
    warn "local\n" if "!ghi3" =~ /$pat/ ;
}
check() ;
---------------------------------
And sure enough, it picks "local".  Too bad there isn't some kind of
late-binding facility in perl, but failing that the "splice the parts
together" looks like it'll work.  THANKS!

  /Bernie\
-- 
Bernie Cosell                     Fantasy Farm Fibers
bernie@fantasyfarm.com            Pearisburg, VA
    -->  Too many people, too few sheep  <--          


------------------------------

Date: Fri, 05 Aug 2011 11:04:46 -0400
From: Bernie Cosell <bernie@fantasyfarm.com>
Subject: Re: Delaying interpolation in a qr
Message-Id: <p91o3799fj630j08t9rdbjmhno6o0lq7gg@library.airnews.net>

Bernie Cosell <bernie@fantasyfarm.com> wrote:

A friend was poking around in the more abstruse corners of REdom for me and
found a brilliant solution to this mess!  It involved (?? ).  the following
code does exactly what I wanted:

} ------------------------------
} #!/usr/bin/perl
} 
} # Test of interpolation in qr's
} 
} my $vbl = 'abc' ;
} my $pat = qr{!(??{$vbl})!} ;
} $vbl = 'def' ;
} 
} sub check
} {   $vbl = 'ghi' ;
}     warn "first\n" if "!abc!" =~ /$pat/ ;
}     warn "second\n" if "!def!" =~ /$pat/ ;
}     warn "local\n" if "!ghi!" =~ /$pat/ ;
} }
} check() ;
} ------------------------------------

Note that I had to 'un my' the vbl in the subroutine -- apparently that
construct ties the RE to the variable mentioned (it feels almost like a
closure), just as I wanted, and so just assigning changes what the RE sees.
So in my logscan example, what I can do is:

my $targetip ;         # Assign desired IPaddress to this vbl in sub
my $pat = qr{(...)\ +(\d+)\ (\d\d:\d\d:\d\d).*rip=(??{$targetip}),} ;

  /Bernie\
-- 
Bernie Cosell                     Fantasy Farm Fibers
bernie@fantasyfarm.com            Pearisburg, VA
    -->  Too many people, too few sheep  <--          


------------------------------

Date: Fri, 05 Aug 2011 14:41:33 +0100
From: Ben Bacarisse <ben.usenet@bsb.me.uk>
Subject: Re: seeking advice on problem difficulty
Message-Id: <0.9023075ec05c58f409a6.20110805144133BST.87y5z8vuuq.fsf@bsb.me.uk>

"ela" <ela@yantai.org> writes:

> "Ben Bacarisse" <ben.usenet@bsb.me.uk> wrote in message 
> news:0.1974a0af27d9bc0648a0.20110804140923BST.87ty9xtjb0.fsf@bsb.me.uk...
>> Functions are crucial to managing complexity.  I'd want a function
>> 'most_frequent' that can take an array of values and find the frequency
>> of the most common value among them.  It could return both that value
>> and the frequency.  Something like:
>
> I'd appreciate if I can learn more from you about the thinking philosophy. 
> As said previously, I only thought of a lot of "if"'s and "hash"'s and never 
> able to use function to wrap up some of the concepts. Would you mind telling 
> me by which cues trigger you to think about using function? 

There's no easy answer to that.  You must try to get into the habit of
dreaming.  You think: what function, if it were available, would make
the job easier?  The academic answer is that you think "top down" -- you
imagine the very highest level of the program before you have any idea
how to write it:

  table = read_table();
  for each group:
    print classify(group, table);

You know that classify will need both the group list and the full table
to do its job so you pass these as parameters. Then you break down
classify:

  classify(group, table):
    for each column in table
       top_item = most_common_item_in(column, table);
       freq = freq_of(top_item, column, table);
       if freq > threshold
          return top_item
    return 'inconsistent'

Here you go "ah, classify needs to know the threshold" so you revise the
parameter list.

When you write most_common_item_in and freq_of you will find that the do
very similar things and you may decide to combine them.  That's what I
did.

The trouble with this plan (and why I say this is the academic answer)
is that this breakdown interacts at all stages with the design of the
data structures that you will use.  The result is, in practice, a lot
more going back and forth between different ideas.  When you find
something is getting too messy to write, you should re-think your data
structures.  That takes a lot of experience though.  What is too messy?
maybe this is a messy problem and there is no neat solution?

I had exactly this problem.  I told you that you probably don't want an
array of columns because I'd missed the duplicate ID problem.  The
trouble was that I did not go back and revise my design.  If I had, I'd
have seen that a simple change to the data structure and the way it is
input is all this is needed.  If we store arrays of items at each
position it is simple to extract and "flatten" these before counting the
frequencies.  I.e. reading the data becomes

  my @column;
  while (<>) {
      chomp;
      my (@row, $c) = split;
      push @{$column[$c++]->[$row[0]]}, $_ foreach @row;
  }

The slice @{$column[$col]}[@some_array_of_rows] is now and array of
array references so we need to flatten it.  A function to do that is

  sub flatten { map {@$_} @_ }

The classify function might then be

  sub classify
  {
      my ($group, $table, $threshold) = @_;
      for (my $col = 1; $col < $#column; $col++) {
          my ($item, $freq) =
              most_frequent(flatten(@{@$table[$col]}[@$group]));
          return $item if $freq >= $threshold;
      }
      return 'inconsistent';
  }

I start at 1 because I have retained the (possible multiple) IDs in the
table.  It makes sense to retain them because it is logically possible
to classify by the first columns as much as it is by any other even
though you don't do this in your case.

Making everything a parameter to classify is obviously the right thing
to do but it does make the key line rather fussy.  Still, three short
functions and some input code is all it took in the end.

-- 
Ben.


------------------------------

Date: Fri, 05 Aug 2011 16:45:00 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: seeking advice on problem difficulty
Message-Id: <87zkjnx3pf.fsf@sapphire.mobileactivedefense.com>

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> "ela" <ela@yantai.org> writes:
>> "Ben Bacarisse" <ben.usenet@bsb.me.uk> wrote in message 

[...]

>> I'd appreciate if I can learn more from you about the thinking philosophy. 
>> As said previously, I only thought of a lot of "if"'s and "hash"'s and never 
>> able to use function to wrap up some of the concepts. Would you mind telling 
>> me by which cues trigger you to think about using function? 
>
> There's no easy answer to that.  You must try to get into the habit of
> dreaming.  You think: what function, if it were available, would make
> the job easier?  The academic answer is that you think "top down" -- you
> imagine the very highest level of the program before you have any idea
> how to write it:
>
>   table = read_table();
>   for each group:
>     print classify(group, table);
>
> You know that classify will need both the group list and the full table
> to do its job so you pass these as parameters. Then you break down
> classify:
>
>   classify(group, table):
>     for each column in table
>        top_item = most_common_item_in(column, table);
>        freq = freq_of(top_item, column, table);
>        if freq > threshold
>           return top_item
>     return 'inconsistent'
>
> Here you go "ah, classify needs to know the threshold" so you revise the
> parameter list.
>
> When you write most_common_item_in and freq_of you will find that the do
> very similar things and you may decide to combine them.  That's what I
> did.
>
> The trouble with this plan (and why I say this is the academic answer)
> is that this breakdown interacts at all stages with the design of the
> data structures that you will use.

I assure you that this is not 'the academic answer' but a perfectly
workable design methodology which has essentially been ignored ever
since its invention in the last century, using whatever pretext seemed
to be most suitable for that. For as long as the intent is to create
working and easily maintainable code in order to solve problems (as
opposed to, say, "contribute your name to the Linux kernel changelog
for the sake of it being in there") 'stepwise refinement' is
definitely worth trying it instead of just assuming that it cannot
possibly work and hence - thank god! - 'we' can continue with the
time-honoured procedure of 'hacking away at it until it's all pieces'.


------------------------------

Date: Fri, 05 Aug 2011 21:16:31 +0100
From: Ben Bacarisse <ben.usenet@bsb.me.uk>
Subject: Re: seeking advice on problem difficulty
Message-Id: <0.75690d7d6e6251e677ea.20110805211631BST.87liv7wr4w.fsf@bsb.me.uk>

Rainer Weikusat <rweikusat@mssgmbh.com> writes:

> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>> "ela" <ela@yantai.org> writes:
>>> "Ben Bacarisse" <ben.usenet@bsb.me.uk> wrote in message 
>
> [...]
>
>>> I'd appreciate if I can learn more from you about the thinking philosophy. 
>>> As said previously, I only thought of a lot of "if"'s and "hash"'s and never 
>>> able to use function to wrap up some of the concepts. Would you mind telling 
>>> me by which cues trigger you to think about using function? 
>>
>> There's no easy answer to that.  You must try to get into the habit of
>> dreaming.  You think: what function, if it were available, would make
>> the job easier?  The academic answer is that you think "top down" -- you
>> imagine the very highest level of the program before you have any idea
>> how to write it:
>>
>>   table = read_table();
>>   for each group:
>>     print classify(group, table);
>>
>> You know that classify will need both the group list and the full table
>> to do its job so you pass these as parameters. Then you break down
>> classify:
>>
>>   classify(group, table):
>>     for each column in table
>>        top_item = most_common_item_in(column, table);
>>        freq = freq_of(top_item, column, table);
>>        if freq > threshold
>>           return top_item
>>     return 'inconsistent'
>>
>> Here you go "ah, classify needs to know the threshold" so you revise the
>> parameter list.
>>
>> When you write most_common_item_in and freq_of you will find that the do
>> very similar things and you may decide to combine them.  That's what I
>> did.
>>
>> The trouble with this plan (and why I say this is the academic answer)
>> is that this breakdown interacts at all stages with the design of the
>> data structures that you will use.
>
> I assure you that this is not 'the academic answer' but a perfectly
> workable design methodology

I did not say it was unworkable.  It's perfectly workable.  However, the
description I gave is "academic" in the sense that it is too simple.  At
least that is my experience.  One rarely finds novel algorithms that
way, so when a problem has an interesting algorithmic core, top-down
design will often miss some interesting solutions.  Also it does not
always lead to good data structures first time round.  I often find
myself backing up a few levels, re-jigging the data and setting off
again is a slightly different direction.

> which has essentially been ignored ever
> since its invention in the last century, using whatever pretext seemed
> to be most suitable for that.

Has it been ignored?  I was taught it and I taught it to others.  The
last time I knew about such things (about a decade ago) it was widely
taught in UK universities.

> For as long as the intent is to create
> working and easily maintainable code in order to solve problems (as
> opposed to, say, "contribute your name to the Linux kernel changelog
> for the sake of it being in there") 'stepwise refinement' is
> definitely worth trying it instead of just assuming that it cannot
> possibly work and hence - thank god! - 'we' can continue with the
> time-honoured procedure of 'hacking away at it until it's all pieces'.

I hope you did not think I was suggesting that it could not possibly
work.  I explained my stepwise approach to the problem precisely because
it led to a simple and clean solution.  Maybe you took "academic" to
mean "impractical" -- I meant only "simplified for pedagogic reasons".

-- 
Ben.


------------------------------

Date: Fri, 5 Aug 2011 23:17:06 -0700
From: "ela" <ela@yantai.org>
Subject: Re: Sorting hash %seen
Message-Id: <j1gtt0$gsk$1@ijustice.itsc.cuhk.edu.hk>


"Tad McClellan" <tadmc@seesig.invalid> wrote in message 
news:slrnj3nneb.c9o.tadmc@tadbox.sbcglobal.net...
> ela <ela@yantai.org> wrote:
>
>> my @sorted_keys = sort { $seen{$b} <=> $seen{$a}}
>
>
> You have left off the list of things to be sorted.

adding %seen solves the problem though warning "Use of uninitialized value 
in numeric comparison (<=>) at test.pl" still exists.

Due to time limitation, I shall focus on working out a draft solution first 
and thank you very much for being patient in teaching me netiquette and a 
lot of other stuff these years. 




------------------------------

Date: Fri, 05 Aug 2011 15:45:31 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Sorting hash %seen
Message-Id: <87d3gjyl10.fsf@sapphire.mobileactivedefense.com>

Justin C <justin.1104@purestblue.com> writes:
> On 2011-08-06, ela <ela@yantai.org> wrote:
>>
>> "Rainer Weikusat" <rweikusat@mssgmbh.com> wrote in message
>>
>>> Then, you'll need something similar to Ben's most_frequent routine
>>
>> I managed to take frequency by:
>>
>>     $seen{$_}++ for (map { $id{$_}{$col} } @{$grp{$grpid}});
>
> I've not tried it (and, TBH, I have trouble getting my head round it,
> map is still new to me) but that doesn't look right to me.
>
> AIUI, the $_ in $seen{$_}++ takes whatever is in $_ before you do the
> 'map', the $_ in the map function is, I believe, local to the map
> function and therefore not available outside the function.

Yes. But the for/ foreach loop which processes the 'output' of the
map expression successively binds $_ to each of the values returned by
that.


------------------------------

Date: Fri, 05 Aug 2011 15:54:00 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Sorting hash %seen
Message-Id: <878vr7ykmv.fsf@sapphire.mobileactivedefense.com>

Tad McClellan <tadmc@seesig.invalid> writes:
> Justin C <justin.1104@purestblue.com> wrote:
>> On 2011-08-06, ela <ela@yantai.org> wrote:
>>> "Rainer Weikusat" <rweikusat@mssgmbh.com> wrote in message
>>>
>>>> Then, you'll need something similar to Ben's most_frequent routine
>>>
>>> I managed to take frequency by:
>>>
>>>     $seen{$_}++ for (map { $id{$_}{$col} } @{$grp{$grpid}});
>>
>> I've not tried it (and, TBH, I have trouble getting my head round it,
>> map is still new to me) but that doesn't look right to me.
>>
>> AIUI, the $_ in $seen{$_}++ takes whatever is in $_ before you do the
>> 'map', 
>
> No, it takes on the values returned from the map.
>
> Rewrite it, and it may become more clear:
>
>     foreach (map { $id{$_}{$col} } @{$grp{$grpid}}) {
>         $seen{$_}++
>     }

Provided someone who reads the code is familiar with the semantics of
loop executing code contained in a block but not familiar with the
semantics of the equivalent statement modifiers, this someone will
understand the 'code with the block' and won't understand the 'code
with statement modifier'. Neither one nor the other has any meaning
someone completely without knowledge of the defined semantics of each
construct can understand, and both constructs necessarily have
'clearly' defined semantics because otherwise, a computer couldn't
execute them, IOW, this is not a problem of one variant vs the other
BUT of availability of or lack of knowledge on part of the person
trying to understand the code.


------------------------------

Date: Fri, 05 Aug 2011 09:31:19 -0700
From: Jim Gibson <jimsgibson@gmail.com>
Subject: Re: Sorting hash %seen
Message-Id: <050820110931195121%jimsgibson@gmail.com>

In article <j1gtt0$gsk$1@ijustice.itsc.cuhk.edu.hk>, ela
<ela@yantai.org> wrote:

> "Tad McClellan" <tadmc@seesig.invalid> wrote in message 
> news:slrnj3nneb.c9o.tadmc@tadbox.sbcglobal.net...
> > ela <ela@yantai.org> wrote:
> >
> >> my @sorted_keys = sort { $seen{$b} <=> $seen{$a}}
> >
> >
> > You have left off the list of things to be sorted.
> 
> adding %seen solves the problem though warning "Use of uninitialized value 
> in numeric comparison (<=>) at test.pl" still exists.
> 
> Due to time limitation, I shall focus on working out a draft solution first 
> and thank you very much for being patient in teaching me netiquette and a 
> lot of other stuff these years. 

You don't want to sort %seen, as that has no meaning. You want to sort
the keys of %seen:

  my @sorted_keys = sort { $seen{$b} <=> $seen{$a}} keys %seen;

-- 
Jim Gibson


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3466
***************************************


home help back first fref pref prev next nref lref last post