[32453] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 3720 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Jun 19 16:14:24 2012

Date: Tue, 19 Jun 2012 13:14:11 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 19 Jun 2012     Volume: 11 Number: 3720

Today's topics:
        question concerning pipes and large strings <mathematisch@gmail.com>
    Re: question concerning pipes and large strings <glex_no-spam@qwest-spam-no.invalid>
    Re: question concerning pipes and large strings <mathematisch@gmail.com>
    Re: question concerning pipes and large strings <ben@morrow.me.uk>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <ben@morrow.me.uk>
    Re: Regular Expression <cal@example.invalid>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 19 Jun 2012 08:50:18 -0700 (PDT)
From: math math <mathematisch@gmail.com>
Subject: question concerning pipes and large strings
Message-Id: <04d58cc5-7643-4279-ae54-f1428235048a@googlegroups.com>

Hi,

I have a file with two tab delimited fields. First field is an ID, the seco=
nd field is a large string (up to hundreds of millions of characters). The =
file may have many lines.

I would like to sort the file on the first (ID) field and after this sortin=
g, merge the second fields (i.e. remove the new lines), so that I get a sin=
gle line with many hundreds of lines that are in the same order appended to=
 each other as their alphabetically sorted IDs.

Is there a way to do that in PERL without reading the whole file into memor=
y?=20

Thanks


------------------------------

Date: Tue, 19 Jun 2012 11:47:28 -0500
From: "J. Gleixner" <glex_no-spam@qwest-spam-no.invalid>
Subject: Re: question concerning pipes and large strings
Message-Id: <4fe0ad20$0$75671$815e3792@news.qwest.net>

On 06/19/12 10:50, math math wrote:
> Hi,
>
> I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.
>
> I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get a single line with many hundreds of lines that are in the same order appended to each other as their alphabetically sorted IDs.
>
> Is there a way to do that in PERL without reading the whole file into memory?
>
> Thanks


What have you tried?

Probably sorting it first would make it much easier:

man sort

These should help with the rest:

perldoc -f open
perldoc -f chomp



------------------------------

Date: Tue, 19 Jun 2012 08:54:23 -0700 (PDT)
From: math math <mathematisch@gmail.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <01ff22d1-9912-4be6-84ff-c4ac6c5d8863@googlegroups.com>

On Tuesday, June 19, 2012 4:50:18 PM UTC+1, math math wrote:
> Hi,
>=20
> I have a file with two tab delimited fields. First field is an ID, the se=
cond field is a large string (up to hundreds of millions of characters). Th=
e file may have many lines.
>=20
> I would like to sort the file on the first (ID) field and after this sort=
ing, merge the second fields (i.e. remove the new lines), so that I get a s=
ingle line with many hundreds of lines that are in the same order appended =
to each other as their alphabetically sorted IDs.
>=20
> Is there a way to do that in PERL without reading the whole file into mem=
ory?=20
>=20
> Thanks

So my first attempt was using a pipe:

open(my $fhd_pipe, "sort -d -k1,1 $my_input_file |" );
open(my $fhd_out, ">test.txt" ) or die $!;
while(<$fhd_pipe>) {
        chomp $sequence;
        my (undef, $sequence) =3D split("\t", $_);
        print $fhd_out $sequence;
 }
close $fhd_out;
close $fhd_pipe;


But with this approach, I cannot check the return value of the first "open"=
 command, which is quite annoying (what if something goes wrong?)


------------------------------

Date: Tue, 19 Jun 2012 18:55:00 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: question concerning pipes and large strings
Message-Id: <k0r6b9-ken1.ln1@anubis.morrow.me.uk>


Quoth math math <mathematisch@gmail.com>:
> 
> I have a file with two tab delimited fields. First field is an ID, the
> second field is a large string (up to hundreds of millions of
> characters). The file may have many lines.

How large is the file altogether? A few hundred megabytes, or
substantially larger? It may well not be worth trying to avoid doing
this in memory.

> I would like to sort the file on the first (ID) field and after this
> sorting, merge the second fields (i.e. remove the new lines), so that
> I get a single line with many hundreds of lines that are in the same
> order appended to each other as their alphabetically sorted IDs.

I don't entirely understand what you mean here. If you began with a file
that looked like

    002 TWO
    001 ONE
    003 THREE

what would you expect to end up with? Perhaps you want

    ONE TWO THREE

with the IDs removed?

> Is there a way to do that in PERL without reading the whole file into
> memory? 

I see other people have recommended sort(1); I would *not* recommend
that, in this case. sort(1) will deal with large files by spilling out
to temporary files on disk, but there's no need for that here.

I would do this in two passes. Start by reading through the file a block
at a time, and finding all the ID fields. (I am assuming these are small
enough that you aren't worried about keeping the whole list in memory).

For each ID, remember the start and finish offset of the second field
(using tell, or by keeping a running count of where you are in the file,
whichever's easier). Put these in a hash keyed by ID.

Then, you can sort the list of IDs, and, for each ID, seek to the right
place in the old file and pull out the data item for the new file
directly.

Ben



------------------------------

Date: Tue, 19 Jun 2012 19:14:09 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87k3z3np8u.fsf@sapphire.mobileactivedefense.com>

math math <mathematisch@gmail.com> writes:
> I have a file with two tab delimited fields. First field is an ID,
> the second field is a large string (up to hundreds of millions of
> characters). The file may have many lines.
>
> I would like to sort the file on the first (ID) field and after this
> sorting, merge the second fields (i.e. remove the new lines), so
> that I get a single line with many hundreds of lines that are in the
> same order appended to each other as their alphabetically sorted
> IDs.
>
> Is there a way to do that in PERL without reading the whole file into memory? 

Provided it is Ok to keep all IDs in memory, something like the code
below could be used. The basic idea is to parse the complete input
file once line-by-line and create an array of 'ID records'. Each ID
record is a reference to a two-element array. The first member is the
ID itself, the second is the (stream-) position where the data part of
this line start (sub get_ids). This array is then sorted by
ID. Afterwards, the code goes through the sorted array, creating the
output lines by starting a new output line whenever a new ID occurs
and concatenating the data parts by seeking to the recorded position
associated with an ID record, reading 'the remainder of the input line' and
printing it (sub generate output).

NB: For brevity, this omits all error handling. It is assumed that
invalid input lines don't occur. Replacing the read_id_data call with
inline code would be an obvious performance improvement. The tab
character separating the ID from the data part is treated as part of
the ID.

NB^2: A probably much faster way to do this would be to go through the
file line-by-line, use a hash to record 'seen' IDs, create a per-ID
output file whenever a new ID is encountered, write the data part of
each input line to the output file for its ID, add a trailing \n to
all output files the input file was completely processed and merge the
output files together in a final processing step.

------------------------
#!/usr/bin/perl
#

sub get_ids
{
    my ($in, $ids) = @_;
    my ($line, $the_id, $pos);

    $pos = tell($in);
    while ($line = <$in>) {
	$line =~ /^([^\t]+\t)/ and $the_id = $1;
	push(@$ids, [$the_id, $pos + length($the_id)]);

	$pos = tell($in);
    }
}

sub read_id_data
{
    my ($fh, $id) = @_;
    my $data;

    seek($fh, $id->[1], 0);
    $data = <$fh>;
    chop($data);

    return $data;
}

sub generate_output
{
    my ($fh, $ids) = @_;
    my ($last_id, $n);

    $last_id = $ids->[0][0];
    print($last_id, read_id_data($fh, $ids->[0]));
    
    while (++$n < @$ids) {
	if ($ids->[$n][0] ne $last_id) {
	    $last_id  = $ids->[$n][0];
	    print("\n", $last_id);
	}

	print(read_id_data($fh, $ids->[$n]));
    }

    print("\n");
}
    
{
    my ($fh, @ids);

    open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");
    
    get_ids($fh, \@ids);
    @ids = sort { $a->[0] cmp $b->[0] } @ids;
    generate_output($fh, \@ids);
}


------------------------------

Date: Tue, 19 Jun 2012 19:33:34 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87fw9rnoch.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:
> Quoth math math <mathematisch@gmail.com>:
>> I have a file with two tab delimited fields. First field is an ID, the
>> second field is a large string (up to hundreds of millions of
>> characters). The file may have many lines.

[...]


>> I would like to sort the file on the first (ID) field and after this
>> sorting, merge the second fields (i.e. remove the new lines), so that
>> I get a single line with many hundreds of lines that are in the same
>> order appended to each other as their alphabetically sorted IDs.

[...]

> I would do this in two passes. Start by reading through the file a block
> at a time, and finding all the ID fields. (I am assuming these are small
> enough that you aren't worried about keeping the whole list in
> memory).

If the file is large and entries with identical ID are not somehow
clustered, this will result in a lot of (useless) I/O.

> For each ID, remember the start and finish offset of the second field
> (using tell, or by keeping a running count of where you are in the file,
> whichever's easier). Put these in a hash keyed by ID.
>
> Then, you can sort the list of IDs, and, for each ID, seek to the right
> place in the old file and pull out the data item for the new file
> directly.

Minus some simplifications, this is a much better idea than building
an ID array in the way I suggested (all caveats still apply).

-----------------------
#!/usr/bin/perl
#

sub get_ids
{
    my ($in, $ids) = @_;
    my ($line, $the_id, $pos);

    $pos = tell($in);
    while ($line = <$in>) {
	$line =~ /^([^\t]+\t)/ and $the_id = $1;
	push(@{$ids->{$the_id}}, $pos + length($the_id));

	$pos = tell($in);
    }
}

sub read_data_at
{
    my ($fh, $pos) = @_;
    my $data;

    seek($fh, $pos, 0);
    $data = <$fh>;
    chop($data);

    return $data;
}
    
{
    my ($fh, %ids);

    open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");
    
    get_ids($fh, \%ids);
    
    for (sort(keys(%ids))) {
	print($_);
	
	print(read_data_at($fh, $_)) for @{$ids{$_}};
	print("\n");
    }
}


------------------------------

Date: Tue, 19 Jun 2012 19:33:04 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: question concerning pipes and large strings
Message-Id: <08t6b9-qun1.ln1@anubis.morrow.me.uk>


Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> math math <mathematisch@gmail.com> writes:
> > I have a file with two tab delimited fields. First field is an ID,
> > the second field is a large string (up to hundreds of millions of
> > characters). The file may have many lines.
> 
> Provided it is Ok to keep all IDs in memory, something like the code
> below could be used. The basic idea is to parse the complete input
> file once line-by-line and create an array of 'ID records'.

This assumes it's OK to read one complete line into memory. Since a line
may be several hundred megabytes, it would probably be safer (though
somewhat more awkward) to read the file blockwise.

Ben



------------------------------

Date: Tue, 19 Jun 2012 11:14:37 -0600
From: Cal Dershowitz <cal@example.invalid>
Subject: Re: Regular Expression
Message-Id: <KYCdnWfdubXgLn3SnZ2dnUVZ_vKdnZ2d@supernews.com>

On 06/17/2012 10:43 PM, Jürgen Exner wrote:
> Cal Dershowitz<cal@example.invalid>  wrote:
>> It's too hard for me, jue.  At the risk of sounding glib about valuable
>> information, if I really don't get the ? character in regex'es,
>
> Well, that is yet another totally different can of worms. And again it
> has different meaing depending upon where it is used.
>
> jue

The ? is just an example in a literature, where there's more that the 
most people don't know than know, but I can look up what any given 
metacharacter means.

Harder for me is putting it together in a way that perl.exe can do 
something useful with.  So I find myself with a decent idea about how to 
solve my own problems, but I have to go take down some legal tender 
instead of work it out.

     print "winner is $winner\n";
     my ($int) =~ m/[^_]*(\d+)/;
     print "int is $int\n";

My point was gonna be that maybe one way to solve my question downthread 
was to use character classes, but I can't quite put it all together.
-- 
Cal


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3720
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[32453] in Perl-Users-Digest

Perl-Users Digest, Issue: 3720 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Tue Jun 19 16:14:24 2012

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Jun 19 16:14:24 2012