[24371] in Perl-Users-Digest
Perl-Users Digest, Issue: 6560 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu May 13 18:10:48 2004
Date: Thu, 13 May 2004 15:10:10 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Thu, 13 May 2004 Volume: 10 Number: 6560
Today's topics:
Re: Using hashes to sort number sequences (Martin Foster)
Re: Using hashes to sort number sequences <usenet@morrow.me.uk>
Re: Using hashes to sort number sequences (Anno Siegel)
Re: Using hashes to sort number sequences (Anno Siegel)
Re: Using hashes to sort number sequences <usenet@morrow.me.uk>
Re: Using match variables ($1, $2 ...) as variables. (Kevin Collins)
Re: Using match variables ($1, $2 ...) as variables. <ThomasKratz@REMOVEwebCAPS.de>
Re: Wanted: Perl Developer <dha@panix.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 13 May 2004 09:03:38 -0700
From: mdfoster44@netscape.net (Martin Foster)
Subject: Re: Using hashes to sort number sequences
Message-Id: <6a20f90a.0405130803.7e3567f4@posting.google.com>
Bob Walton <invalid-email@rochester.rr.com> wrote in message news:<40A2E65B.1020108@rochester.rr.com>...
> Martin Foster wrote:
>
> ...
> > I have two files: a.txt & b.txt
> >
> > a.txt=
> > 191_6_270328 T1 4 10 19 34 55 72 88 116 157 200 280 332 388 451 756 4
> > 0 5 0 4 0 6 2 6 2 8 0
> > 191_6_270328 T2 4 9 17 22 34 56 83 112 146 181 266 320 376 431 665 3 0
> ...
> > b.txt=
> > 191_6_9908682 T1 4 8 14 25 41 60 83 115 153 190 276 321 374 437 694 4
> > 0 4 0 4 0 6 0 4 0 8 0
> > 191_6_9908682 T2 4 10 19 30 44 64 92 122 155 198 285 338 394 446 739 4
> > 0 5 0 4 0 6 0 8 0 8 2
> ...
>
>
> > Each file contains in the first column an identifier, I call it $name.
> > The 2nd column contains an entry T1 or T2 or T3 ... until T6.
> > After these two columns each row contains a number sequence.
> >
> > What I would like to do is to read file a.txt, six lines at a time
> > (from T1 to T6)
> > and search for similar number sequences in file b.txt.
> > The number sequences in file b.txt must also be within each block of
> > six lines,
> > but they can be in any order.
>
>
> Why don't you just sort (using the Unix or maybe even the Win32 sort
> command) the two files, and then, using Perl, read and compare from the
> two sorted files? Or maybe the -u switch on Unix's sort could give you
> what you want in one go. Or maybe (if the data for matching lines is
> all the same), after the sorts, use diff to do the compare, and just
> process the output of diff with Perl? Or if there is something in the
> data which indicates if it from INFILE1 versus INFILE2, the files could
> be concatenated, sorted, and processed as one file (I don't think that
> last method would have any advantages).
>
I may need to tell you a little more about the data, I'm not sure a sort
would help me but maybe you have an idea.
Each $name tag is the name of a crystal structure. Each T1, T2, etc describes
an atom. For each structure there are six atoms. To identify if two crystal
structures are the same, one can compare the coordination sequences ( the number
sequences that follow the T1, T2, etc). For each structure all six sequences,
must completely match another six sequences of another structure, but they can
be in any order, ie T1, T2s may be called T3, T6 or whatever. The important
part is that each structure has six lines, which is why I want to read
them in separately. If I do a sort I will get matching lines of sequences
grouped together. For some structures, only one or two lines will match the
original structure and I will have to do careful counting throughout the
output to get what I want.
> That sort (punny, huh?) of method will avoid reading your $infile2 many
> hundreds of thousands of times, which will take almost forever.
>
Oh know, I hope not! My first attempt was to do this directly from the
MySQL database. I retrieved the data with queries for each structure.
That did take forever! So now I've put all the data into files.
I may need to rewrite the script to not reopen the b file again and again.
Maybe by passing in all the data to arrays and then shifting six lines of the
array into the hashes. I've got 512mb memory, how big can the arrays be?
I've got 29 columns and the a & b files have ~127,000 rows. I'm not sure.
> BTW, you would need to either close and reopen the file in your inner
> loop, or seek() it back to the beginning every time you go though the
> outer loop.
I will try this. Thanks.
>Also recognize that your while(<INFILE1>) and
> while(<INFILE2>) constructions will read a record from the corresponding
> file and place it into $_. You are discarding that data, so you are
> really reading data 7 records at a time, discarding the first of each
> chunk of 7.
I'm not quite sure what you mean by this. Do you suggest to use another
variable for $_ in the inner loop?
>
> HTH.
>
>
> ...
>
>
> > Martin.
------------------------------
Date: Thu, 13 May 2004 16:47:34 +0000 (UTC)
From: Ben Morrow <usenet@morrow.me.uk>
Subject: Re: Using hashes to sort number sequences
Message-Id: <c808r6$n7v$1@wisteria.csv.warwick.ac.uk>
Quoth mdfoster44@netscape.net (Martin Foster):
> Bob Walton <invalid-email@rochester.rr.com> wrote in message news:<40A2E65B.1020108@rochester.rr.com>...
> > Martin Foster wrote:
> > ...
> > > I have two files: a.txt & b.txt
> > >
> > > a.txt=
> > > 191_6_270328 T1 4 10 19 34 55 72 88 116 157 200 280 332 388 451 756 4
> > > 0 5 0 4 0 6 2 6 2 8 0
> > > 191_6_270328 T2 4 9 17 22 34 56 83 112 146 181 266 320 376 431 665 3 0
> > ...
> > > b.txt=
> > > 191_6_9908682 T1 4 8 14 25 41 60 83 115 153 190 276 321 374 437 694 4
> > > 0 4 0 4 0 6 0 4 0 8 0
> > > 191_6_9908682 T2 4 10 19 30 44 64 92 122 155 198 285 338 394 446 739 4
> > > 0 5 0 4 0 6 0 8 0 8 2
> > ...
> > > Each file contains in the first column an identifier, I call it $name.
> > > The 2nd column contains an entry T1 or T2 or T3 ... until T6.
> > > After these two columns each row contains a number sequence.
> > >
> > > What I would like to do is to read file a.txt, six lines at a time
> > > (from T1 to T6)
> > > and search for similar number sequences in file b.txt.
> > > The number sequences in file b.txt must also be within each block of
> > > six lines,
> > > but they can be in any order.
>
> Each $name tag is the name of a crystal structure. Each T1, T2, etc describes
> an atom. For each structure there are six atoms. To identify if two crystal
> structures are the same, one can compare the coordination sequences ( the number
> sequences that follow the T1, T2, etc). For each structure all six sequences,
> must completely match another six sequences of another structure, but they can
> be in any order, ie T1, T2s may be called T3, T6 or whatever. The important
> part is that each structure has six lines, which is why I want to read
> them in separately. If I do a sort I will get matching lines of sequences
> grouped together. For some structures, only one or two lines will match the
> original structure and I will have to do careful counting throughout the
> output to get what I want.
>
> > That sort (punny, huh?) of method will avoid reading your $infile2 many
> > hundreds of thousands of times, which will take almost forever.
> >
> Oh know, I hope not! My first attempt was to do this directly from the
> MySQL database. I retrieved the data with queries for each structure.
> That did take forever! So now I've put all the data into files.
>
> I may need to rewrite the script to not reopen the b file again and again.
> Maybe by passing in all the data to arrays and then shifting six lines of the
> array into the hashes. I've got 512mb memory, how big can the arrays be?
> I've got 29 columns and the a & b files have ~127,000 rows. I'm not sure.
Try something like this (untested):
#!/usr/bin/perl -l
use strict;
use warnings;
use Symbol;
# the (*) means that the sub takes one parameter, which will be interpreted
# as a filehandle
sub read_crystal (*) {
# this will find the correct FH in case this sub is called
# from a different package (string and bareword FH names are
# relative to the current package, with exceptions for STDIN etc.)
my $FH = Symbol::qualify_to_ref $_[0], caller;
my (@atoms, $crystal);
for (1..6) {
my $line = <$FH>;
defined $line or die "read error: $!";
$line or return;
my ($tmp, undef, $atom) = split ' ', $line, 3;
$crystal ||= $tmp;
$crystal eq $tmp or die "bad file format: $tmp ne $crystal";
push @atoms, $atom;
}
# this creates a canonical representation of a given crystal
# (i.e. it will be the same regardless of the order the atoms
# were given in the file)
my $canon = join '|', sort @atoms;
return ($crystal, $canon);
}
my %crystals;
{
# lexical FHs like this are closed automatically when they
# go out of scope
open my $B, '<', 'b.txt' or die "can't open b.txt: $!";
while (my ($crystal, $canon) = read_crystal $B) {
$crystals{$canon} = $crystal;
}
}
{
open my $A, '<', 'a.txt' or die "can't open a.txt: $!";
while (my ($crystal, $canon) = read_crystal $A) {
if ($crystals{$canon}) {
print "$crystal is the same as $crystals{$canon}";
}
}
}
__END__
If b.txt is too large, and you do run out of memory (or it is
unacceptably slow), you can speed things up by tying %crystals to a db
file:
use DB_File;
tie my %crystals, DB_File => 'b.db' or die "can't create b.db: $!";
This will also allow you to create the db from b.txt once and then check
several different files against it.
Ben
--
Joy and Woe are woven fine,
A Clothing for the Soul divine William Blake
Under every grief and pine 'Auguries of Innocence'
Runs a joy with silken twine. ben@morrow.me.uk
------------------------------
Date: 13 May 2004 17:13:44 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: Using hashes to sort number sequences
Message-Id: <c80ac8$9fm$1@mamenchi.zrz.TU-Berlin.DE>
Martin Foster <mdfoster44@netscape.net> wrote in comp.lang.perl.misc:
> Bob Walton <invalid-email@rochester.rr.com> wrote in message
> news:<40A2E65B.1020108@rochester.rr.com>...
> > Martin Foster wrote:
> >
> > ...
> > > I have two files: a.txt & b.txt
> > >
> > > a.txt=
> > > 191_6_270328 T1 4 10 19 34 55 72 88 116 157 200 280 332 388 451 756 4
> > > 0 5 0 4 0 6 2 6 2 8 0
> > > 191_6_270328 T2 4 9 17 22 34 56 83 112 146 181 266 320 376 431 665 3 0
> > ...
> > > b.txt=
> > > 191_6_9908682 T1 4 8 14 25 41 60 83 115 153 190 276 321 374 437 694 4
> > > 0 4 0 4 0 6 0 4 0 8 0
> > > 191_6_9908682 T2 4 10 19 30 44 64 92 122 155 198 285 338 394 446 739 4
> > > 0 5 0 4 0 6 0 8 0 8 2
> > ...
[...]
> > Why don't you just sort (using the Unix or maybe even the Win32 sort
[...]
> I may need to tell you a little more about the data, I'm not sure a sort
> would help me but maybe you have an idea.
>
> Each $name tag is the name of a crystal structure. Each T1, T2, etc describes
> an atom. For each structure there are six atoms. To identify if two crystal
> structures are the same, one can compare the coordination sequences ( the number
> sequences that follow the T1, T2, etc). For each structure all six sequences,
> must completely match another six sequences of another structure, but they can
> be in any order, ie T1, T2s may be called T3, T6 or whatever. The important
> part is that each structure has six lines, which is why I want to read
> them in separately. If I do a sort I will get matching lines of sequences
> grouped together. For some structures, only one or two lines will match the
> original structure and I will have to do careful counting throughout the
> output to get what I want.
If I get that right, there is a set of atoms (represented by sequences
of numbers), and a crystal (structure) is a sequence of six atoms. The
problem is to find the sequences that are permutations of each other.
If I got that entirely wrong, you can stop reading now.
Otherwise, the straightforward solution involves indeed sorting, but
not of the file as a whole, but of each set of six atoms. After sorting,
two permutations of the same atoms are equal (no matter how you sort).
This reduces the problem to finding the elements in a list that are
the same. Perl's standard solutions (involving a hash) apply.
In the actual case it may pay to re-encode the atoms with shorter
strings, which would save storage and might reduce sort time. I'm
not sure about the effect of key length on Perl's string sort. Uri?
How many different atoms are there? If they represent actual chemical
elements there can't be too many.
Before I go on further I'd like some feedback if this sounds plausible
at all.
Anno
------------------------------
Date: 13 May 2004 17:35:57 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: Using hashes to sort number sequences
Message-Id: <c80blt$a4b$1@mamenchi.zrz.TU-Berlin.DE>
Ben Morrow <usenet@morrow.me.uk> wrote in comp.lang.perl.misc:
>
> Quoth mdfoster44@netscape.net (Martin Foster):
> > Bob Walton <invalid-email@rochester.rr.com> wrote in message
> news:<40A2E65B.1020108@rochester.rr.com>...
> > > Martin Foster wrote:
[snip problem to get to the code]
> Try something like this (untested):
>
> #!/usr/bin/perl -l
>
> use strict;
> use warnings;
> use Symbol;
>
> # the (*) means that the sub takes one parameter, which will be interpreted
> # as a filehandle
> sub read_crystal (*) {
> # this will find the correct FH in case this sub is called
> # from a different package (string and bareword FH names are
> # relative to the current package, with exceptions for STDIN etc.)
> my $FH = Symbol::qualify_to_ref $_[0], caller;
>
> my (@atoms, $crystal);
> for (1..6) {
> my $line = <$FH>;
> defined $line or die "read error: $!";
> $line or return;
This test won't work unless $line is chomp'ed. I'd append "...or return"
to the "split" line that comes next.
> my ($tmp, undef, $atom) = split ' ', $line, 3;
[snip rest, nothing to improve there]
That's the algorithm I also recommended in another post to this thread,
presented in beautiful Perl. That program was a good read.
Anno
------------------------------
Date: Thu, 13 May 2004 20:36:27 +0000 (UTC)
From: Ben Morrow <usenet@morrow.me.uk>
Subject: Re: Using hashes to sort number sequences
Message-Id: <c80m8b$2he$1@wisteria.csv.warwick.ac.uk>
Quoth anno4000@lublin.zrz.tu-berlin.de (Anno Siegel):
> Ben Morrow <usenet@morrow.me.uk> wrote in comp.lang.perl.misc:
> >
> > my $line = <$FH>;
> > defined $line or die "read error: $!";
> > $line or return;
>
> This test won't work unless $line is chomp'ed. I'd append "...or return"
> to the "split" line that comes next.
Arrgh, no, there's a more important bug... I was assuming <> returned
undef only on error, and false-but-defined on EOF. So, what I meant was:
undef $!;
my $line = <$FH>;
$! and die "read error: $!";
$line or return;
Ben
--
$.=1;*g=sub{print@_};sub r($$\$){my($w,$x,$y)=@_;for(keys%$x){/main/&&next;*p=$
$x{$_};/(\w)::$/&&(r($w.$1,$x.$_,$y),next);$y eq\$p&&&g("$w$_")}};sub t{for(@_)
{$f&&($_||&g(" "));$f=1;r"","::",$_;$_&&&g(chr(0012))}};t # ben@morrow.me.uk
$J::u::s::t, $a::n::o::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $.
------------------------------
Date: Thu, 13 May 2004 18:27:19 GMT
From: spamtotrash@toomuchfiction.com (Kevin Collins)
Subject: Re: Using match variables ($1, $2 ...) as variables.
Message-Id: <slrnca7fg7.9gp.spamtotrash@doom.unix-guy.com>
In article <c7us1e$1m6r$1@agate.berkeley.edu>, Ilya Zakharevich wrote:
> [A complimentary Cc of this posting was sent to
> Ravi Parimi
><parimi@none.nowhere.com>], who wrote in article
><Pine.GSO.4.58.0405101802580.18191@shellfish.ece.arizona.edu>:
>
>> Usage of ${$i} is incorrect and doesnt make sense.
>
> Could you elaborate on this point?
>
> >perl -wle "'abcde' =~ /(.)c(.)/ or die; $v=2; print $$v"
> d
And ${$v} is another way of writing $$v... see 'perldoc perlref', specifically
item #2 under section "Using References".
Kevin
------------------------------
Date: Thu, 13 May 2004 21:03:06 +0200
From: Thomas Kratz <ThomasKratz@REMOVEwebCAPS.de>
Subject: Re: Using match variables ($1, $2 ...) as variables.
Message-Id: <40a3c7dd.0@juno.wiesbaden.netsurf.de>
Kevin Collins wrote:
> In article <c7us1e$1m6r$1@agate.berkeley.edu>, Ilya Zakharevich wrote:
>
>>[A complimentary Cc of this posting was sent to
>>Ravi Parimi
>><parimi@none.nowhere.com>], who wrote in article
>><Pine.GSO.4.58.0405101802580.18191@shellfish.ece.arizona.edu>:
>>
>>
>>>Usage of ${$i} is incorrect and doesnt make sense.
>>
>>Could you elaborate on this point?
>>
>> >perl -wle "'abcde' =~ /(.)c(.)/ or die; $v=2; print $$v"
>> d
>
>
> And ${$v} is another way of writing $$v... see 'perldoc perlref', specifically
> item #2 under section "Using References".
You are missing the point. Ilya just wanted to know how the usage of ${$i}
is incorrect or doesn't make sense. And he provided a syntactically
correct example, that does make some sense (for arbitrary values of sense).
Thomas
--
open STDIN,"<&DATA";$=+=14;$%=50;while($_=(seek( #J~.> a>n~>>e~.......>r.
STDIN,$:*$=+$,+$%,0),getc)){/\./&&last;/\w| /&&( #.u.t.^..oP..r.>h>a~.e..
print,$_=$~);/~/&&++$:;/\^/&&--$:;/>/&&++$,;/</ #.>s^~h<t< ..~. ...c.^..
&&--$,;$:%=4;$,%=23;$~=$_;++$i==1?++$,:_;}__END__#....>>e>r^..>l^...>k^..
------------------------------
Date: Thu, 13 May 2004 18:49:13 +0000 (UTC)
From: "David H. Adler" <dha@panix.com>
Subject: Re: Wanted: Perl Developer
Message-Id: <slrnca7gp9.fg9.dha@panix2.panix.com>
In article <58a37fec.0405011202.6f710c62@posting.google.com>, Stephen
O'Brien wrote:
> I need a perl developer for the following job:
You have posted a job posting or a resume in a technical group.
Longstanding Usenet tradition dictates that such postings go into
groups with names that contain "jobs", like "misc.jobs.offered", not
technical discussion groups like the ones to which you posted.
Had you read and understood the Usenet user manual posted frequently to
"news.announce.newusers", you might have already known this. :) (If
n.a.n is quieter than it should be, the relevent FAQs are available at
http://www.faqs.org/faqs/by-newsgroup/news/news.announce.newusers.html)
Another good source of information on how Usenet functions is
news.newusers.questions (information from which is also available at
http://www.geocities.com/nnqweb/).
Please do not explain your posting by saying "but I saw other job
postings here". Just because one person jumps off a bridge, doesn't
mean everyone does. Those postings are also in error, and I've
probably already notified them as well.
If you have questions about this policy, take it up with the news
administrators in the newsgroup news.admin.misc.
http://jobs.perl.org may be of more use to you
Yours for a better usenet,
dha
--
David H. Adler - <dha@panix.com> - http://www.panix.com/~dha/
We went on holiday by mistake - Withnail
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 6560
***************************************