[32454] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3721 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed Jun 20 14:09:18 2012

Date: Wed, 20 Jun 2012 11:09:07 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Wed, 20 Jun 2012     Volume: 11 Number: 3721

Today's topics:
    Re: an effective script for grabbing and putting images <jimsgibson@gmail.com>
    Re: an effective script for grabbing and putting images <cal@example.invalid>
    Re: Making CGI.pm forget <xhoster@gmail.com>
    Re: Making CGI.pm forget <struebig@uni-mainz.de>
    Re: question concerning pipes and large strings <ben@morrow.me.uk>
    Re: question concerning pipes and large strings <xhoster@gmail.com>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <mathematisch@gmail.com>
    Re: question concerning pipes and large strings <mathematisch@gmail.com>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <mathematisch@gmail.com>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
        RegEx III (WAS: an effective script for grabbing and pu <jurgenex@hotmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 20 Jun 2012 10:10:06 -0700
From: Jim Gibson <jimsgibson@gmail.com>
Subject: Re: an effective script for grabbing and putting images from or to a website
Message-Id: <200620121010063336%jimsgibson@gmail.com>

In article <TKydncSD_q2Z_XzSnZ2dnUVZ_uidnZ2d@supernews.com>, Cal
Dershowitz <cal@example.invalid> wrote:

> On 06/19/2012 12:29 PM, Ben Morrow wrote:
> >
> > Quoth Cal Dershowitz<cal@example.invalid>:
> 

> >
> > I suspect that you don't want the inner for loop at all; that is, you
> > want
> >
> >      for my $name (@files) {
> >          my ($ext) = $name =~ /([^.]*)$/;
> >
> >          my @matching = grep /\.$ext$/, @list;
> >          @matching = grep /image_\d+/, @matching;
> >          @matching = sort @matching;
> >
> >          my $winner = pop @matching;
> >      }
> 
> What is the precise roll of the second dollar sign above in the first 
> grep call?

It anchors the match to the end of the string.


> [snip]
> >
> >      @matching = map /image_(\d+)/, @matching;
> >
> > This will leave @matching containing just a list of numbers, so then you
> > can say
> >
> >      my $newnum = $winner + 1;
> >      my $newfile = "image_$newnum.$ext";
> >
> > to build a new filename.
> 
> This was all good till I hit 11.  See below.

[snip]
 
> [Lots of output:  I'll try to leave enough so that people can see my 
> problem when I hit ten in this script]

[snip]

> name is image_10.png

I notice that this file name does not include a path ... (see below)

> ext is png
> matching is 2 3 4 5 6 7 8 9 10
> newfile is image_10.png
> #  commenting on output HERE
> Cannot open Local file image_10.png: No such file or directory
>   at upload14.pl line 59
> put failed No such file or directory

There is no file image_10.png in the current directory. What is your
default directory when you execute the script?


> $ cat upload14.pl
> #!/usr/bin/perl -w
> use strict;
> use 5.010;
> use Net::FTP;
> my $domain   = '';
> my $username = '';
> my $password = '';
> my $ftp      = Net::FTP->new( $domain, Debug => 1, Passive => 1 )
>    or die "Can't connect: $@\n";
> $ftp->login( $username, $password ) or die "Couldn't login\n";
> $ftp->binary();
> 
> # get files from remote root that end in html:
> my @remote_files = $ftp->ls();
> # print "remote files are: @remote_files\n";
> my @matching = map /lh_(\d+)\.html/, @remote_files;
> print "matching is @matching\n";
> push( @matching, 1 );
> @matching = sort @matching;

By default, sort will do an alphabetical sort, so '11' will come before
'2'. This is likely the cause of your algorithm breaking down at 11.
Try this:

@matching = sort { $a <=> $b } @matching;

which will do a numerical sort.

> my $winner    = pop @matching;
> my $newnum1   = $winner + 1;
> my $html_file = "lh_$newnum1.html";
> print "html file is  $html_file\n";
> 
> # create file for html stubouts
> open FH, ">$html_file";

Obligatory advice:

1. You should use lexical variables instead of globals for file handles.
2. You should use the three-argument version of open.
3. You should always check the return value of system calls.

open( my $fh, '>', $html_file ) or 
die("Can't open $html_file for writing: $!");

> print FH "<html>\n";
> print FH "<head>\n";
> print FH "<title>Lutherhaven Renovation</title>\n";
> print FH "</head>\n";
> print FH "<body bgcolor=white>\n";
> print FH "<h1>My First Heading</h1>\n";
> # more of this will be populated when I work the kinks out.
> close FH;
> $ftp->put($html_file) or die "put failed $@\n";
> 
> # get files from Desktop/images/
> my $path  = '/home/dan/Desktop/upload_luther/';

 ... (continued from above) but this path contains a directory path.

> my @files = <$path*>;
> 
> # get ls from remote image directory
> $ftp->cwd('/images/') or die "cwd failed $@\n";
> my @list = $ftp->ls();
> 
> # main control
> for my $name (@files) {
>      print "name is $name\n";

Why doesn't this file name contain a path?

>      my ($ext) = $name =~ /([^.]*)$/;
>      print "ext is $ext\n";
> 
>      @matching = map /image_(\d+)\.$ext$/, @list;
>      print "matching is @matching\n";
>      push( @matching, 1 );
>      @matching = sort @matching;

You also need a numerical sort here.

>      $winner   = pop @matching;
>      my $newnum    = $winner + 1;
>      my $new_file2 = "image_$newnum.$ext";
>      print "newfile is $new_file2\n";
>      $ftp->put( $name, $new_file2 ) or die "put failed $!\n";
>      push( @list, $new_file2 );
> 
> }

-- 
Jim Gibson


------------------------------

Date: Tue, 19 Jun 2012 23:30:42 -0600
From: Cal Dershowitz <cal@example.invalid>
Subject: Re: an effective script for grabbing and putting images from or to a website
Message-Id: <TKydncSD_q2Z_XzSnZ2dnUVZ_uidnZ2d@supernews.com>

On 06/19/2012 12:29 PM, Ben Morrow wrote:
>
> Quoth Cal Dershowitz<cal@example.invalid>:

> Regex captures always return exactly what was in the original string.
> This even applies to things like
>
>      "xXxxX" =~ /(x*)/i
>
> where you might expect the capture to be lower-cased.

I'm trying to get my head around this.  I read with as much free time as 
I have.  I had the Camel book at arby's today, but somehow there's no 
substitute for "talking" about them.

> I don't think this can be what you mean. You are running over @list
> multiple times (the for loop goes over the list once, and then each time
> round the grep runs over the whole list again), and you aren't using
> $ext anywhere.

I think you're right here.  I think the mapping abrogates the necessity 
of the inner loop.
>
> I suspect that you don't want the inner for loop at all; that is, you
> want
>
>      for my $name (@files) {
>          my ($ext) = $name =~ /([^.]*)$/;
>
>          my @matching = grep /\.$ext$/, @list;
>          @matching = grep /image_\d+/, @matching;
>          @matching = sort @matching;
>
>          my $winner = pop @matching;
>      }

What is the precise roll of the second dollar sign above in the first 
grep call?

[snip]
>
>      @matching = map /image_(\d+)/, @matching;
>
> This will leave @matching containing just a list of numbers, so then you
> can say
>
>      my $newnum = $winner + 1;
>      my $newfile = "image_$newnum.$ext";
>
> to build a new filename.

This was all good till I hit 11.  See below.
>
>> Also, when I make the call to $ftp->ls(); I get back the dot and the
>> double dot.  Is there a nifty way to weed those out like I did with the
>> statement before it?
>
> Not as such. They're not difficult to get rid of, though: your existing
> logic to match extensions will get rid of them already.

ok, so there's a lot of rough corners yet on this, but the whole 
scenario really seems to fall flat when I hit the number 10.

[Lots of output:  I'll try to leave enough so that people can see my 
problem when I hit ten in this script]
Net::FTP>>> Net::FTP(2.77)
 ...
Net::FTP=GLOB(0x8e91d28)<<< 226 Transfer complete
matching is 1 2 3 4 5 6 7 8 9 10
html file is  lh_10.html
 ...
name is 
/home/dan/Desktop/upload_luther/400px-Solar_System_size_to_scale.svg.png
ext is png
matching is 2 3 4 5 6 7 8 9 10
newfile is image_10.png
 ...
ext is jpg
matching is 1 2 3 4 5 6 7 8 9 10
newfile is image_10.jpg
[snip]
name is image_10.png
ext is png
matching is 2 3 4 5 6 7 8 9 10
newfile is image_10.png
#  commenting on output HERE
Cannot open Local file image_10.png: No such file or directory
  at upload14.pl line 59
put failed No such file or directory
$ cat upload14.pl
#!/usr/bin/perl -w
use strict;
use 5.010;
use Net::FTP;
my $domain   = '';
my $username = '';
my $password = '';
my $ftp      = Net::FTP->new( $domain, Debug => 1, Passive => 1 )
   or die "Can't connect: $@\n";
$ftp->login( $username, $password ) or die "Couldn't login\n";
$ftp->binary();

# get files from remote root that end in html:
my @remote_files = $ftp->ls();
# print "remote files are: @remote_files\n";
my @matching = map /lh_(\d+)\.html/, @remote_files;
print "matching is @matching\n";
push( @matching, 1 );
@matching = sort @matching;
my $winner    = pop @matching;
my $newnum1   = $winner + 1;
my $html_file = "lh_$newnum1.html";
print "html file is  $html_file\n";

# create file for html stubouts
open FH, ">$html_file";
print FH "<html>\n";
print FH "<head>\n";
print FH "<title>Lutherhaven Renovation</title>\n";
print FH "</head>\n";
print FH "<body bgcolor=white>\n";
print FH "<h1>My First Heading</h1>\n";
# more of this will be populated when I work the kinks out.
close FH;
$ftp->put($html_file) or die "put failed $@\n";

# get files from Desktop/images/
my $path  = '/home/dan/Desktop/upload_luther/';
my @files = <$path*>;

# get ls from remote image directory
$ftp->cwd('/images/') or die "cwd failed $@\n";
my @list = $ftp->ls();

# main control
for my $name (@files) {
     print "name is $name\n";
     my ($ext) = $name =~ /([^.]*)$/;
     print "ext is $ext\n";

     @matching = map /image_(\d+)\.$ext$/, @list;
     print "matching is @matching\n";
     push( @matching, 1 );
     @matching = sort @matching;
     $winner   = pop @matching;
     my $newnum    = $winner + 1;
     my $new_file2 = "image_$newnum.$ext";
     print "newfile is $new_file2\n";
     $ftp->put( $name, $new_file2 ) or die "put failed $!\n";
     push( @list, $new_file2 );

}

$

This isn't a work of art, but it's definitely a work of work.  I thought 
I might be home free until I hit ten.  What gives with that?

I can't for the life of me see why perl thinks it needs to find a local 
file named image_10.png at the HERE mark in the output.

Also, I think the appropriate html for an image includes its height and 
width.  I know that's a trick the Imagemagick does, but does someone 
know a slick way to get such data using perl syntax?

Thanks for your comment,
-- 
Cal



------------------------------

Date: Tue, 19 Jun 2012 21:31:18 -0700
From: Xho Jingleheimerschmidt <xhoster@gmail.com>
Subject: Re: Making CGI.pm forget
Message-Id: <4fe151d4$0$24189$ed362ca5@nr5-q3a.newsreader.com>

On 06/19/2012 08:51 AM, Bernie Cosell wrote:
> One aspect of CGI.pm that I've never liked is that it tries to 'remember'
> form variables from one page to the next --- and worse makes it hard to
> change a value, preferring to ignore your attempts to change the value and
> keep the old one unless you go to some bother.

Is reading the documentation "some bother"?

        -nosticky
            By default the CGI module implements a state-preserving 
behavior called "sticky" fields.  The way this works is that if you are 
regenerating a form, the methods that generate the form field values 
will interrogate param() to see if similarly-named parameters are 
present in the query string. If they find a like-named parameter, they 
will use it to set their default values.

            Sometimes this isn't what you want.  The -nosticky pragma 
prevents this behavior.  You can also selectively change the sticky 
behavior in each element that you generate.

END DOC

Xho


------------------------------

Date: Wed, 20 Jun 2012 11:25:59 +0200
From: =?ISO-8859-1?Q?=22J=2E_Str=FCbig=22?= <struebig@uni-mainz.de>
Subject: Re: Making CGI.pm forget
Message-Id: <jrs4v7$iqe$1@infosys-01.zdv.uni-mainz.de>



Am 19.06.2012 17:51, schrieb Bernie Cosell:
> One aspect of CGI.pm that I've never liked is that it tries to 'remember'
> form variables from one page to the next --- and worse makes it hard to
> change a value, 

that's not true. it's very easy to change this and the docs say it clearly:

Another note The default values that you specify for the forms are only
used the first time the script is invoked (when there is no query
string). On subsequent invocations of the script (when there is a query
string), the former values are used even if they are blank.

If you want to change the value of a field from its previous value, you
have two choices:

(1) call the param() method to set it.

(2) use the -override (alias -force) parameter (a new feature in version
2.15). This forces the default value to be used, regardless of the
previous value:

http://search.cpan.org/~markstos/CGI.pm-3.59/lib/CGI.pm#CREATING_FILL-OUT_FORMS:


------------------------------

Date: Tue, 19 Jun 2012 21:26:31 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: question concerning pipes and large strings
Message-Id: <ns37b9-clp1.ln1@anubis.morrow.me.uk>


Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> Ben Morrow <ben@morrow.me.uk> writes:
> > Quoth math math <mathematisch@gmail.com>:
> >> I have a file with two tab delimited fields. First field is an ID, the
> >> second field is a large string (up to hundreds of millions of
> >> characters). The file may have many lines.
> 
> [...]
> 
> 
> >> I would like to sort the file on the first (ID) field and after this
> >> sorting, merge the second fields (i.e. remove the new lines), so that
> >> I get a single line with many hundreds of lines that are in the same
> >> order appended to each other as their alphabetically sorted IDs.
> 
> [...]
> 
> > I would do this in two passes. Start by reading through the file a block
> > at a time, and finding all the ID fields. (I am assuming these are small
> > enough that you aren't worried about keeping the whole list in
> > memory).
> 
> If the file is large and entries with identical ID are not somehow
> clustered, this will result in a lot of (useless) I/O.

How so?

> > For each ID, remember the start and finish offset of the second field
> > (using tell, or by keeping a running count of where you are in the file,
> > whichever's easier). Put these in a hash keyed by ID.
> >
> > Then, you can sort the list of IDs, and, for each ID, seek to the right
> > place in the old file and pull out the data item for the new file
> > directly.
> 
> Minus some simplifications, this is a much better idea than building
> an ID array in the way I suggested (all caveats still apply).
> 
> -----------------------
> #!/usr/bin/perl
> #
> 
> sub get_ids
> {
>     my ($in, $ids) = @_;
>     my ($line, $the_id, $pos);
> 
>     $pos = tell($in);
>     while ($line = <$in>) {

You're still reading an entire line into memory.

> 	$line =~ /^([^\t]+\t)/ and $the_id = $1;
> 	push(@{$ids->{$the_id}}, $pos + length($the_id));

Ah, you are assuming IDs might be duplicated. I was assuming they were
unique, and just needed sorting. The OP will have to clarify this.

Ben



------------------------------

Date: Tue, 19 Jun 2012 21:42:40 -0700
From: Xho Jingleheimerschmidt <xhoster@gmail.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <4fe1547a$0$20184$ed362ca5@nr5-q3a.newsreader.com>

On 06/19/2012 08:54 AM, math math wrote:
> On Tuesday, June 19, 2012 4:50:18 PM UTC+1, math math wrote:
>> Hi,
>>
>> I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.
>>
>> I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get a single line with many hundreds of lines that are in the same order appended to each other as their alphabetically sorted IDs.
>>
>> Is there a way to do that in PERL without reading the whole file into memory?
>>
>> Thanks
>
> So my first attempt was using a pipe:
>
> open(my $fhd_pipe, "sort -d -k1,1 $my_input_file |" );
> open(my $fhd_out, ">test.txt" ) or die $!;
> while(<$fhd_pipe>) {
>          chomp $sequence;
>          my (undef, $sequence) = split("\t", $_);
>          print $fhd_out $sequence;
>   }
> close $fhd_out;
> close $fhd_pipe;
>
>
> But with this approach, I cannot check the return value of the first "open" command

You can (and should) but it will only tell you if the shell failed to 
start, not if the sort failed to find the named input file.


>, which is quite annoying (what if something goes wrong?)

You can still check the return value of the "close $fhd_pipe", which 
should fail if an error in the sort occurred.  Also, either the sort or 
the shell should send a message to stderr.  Whether you are likely to 
notice that depends on unspecified details.

Xho


------------------------------

Date: Wed, 20 Jun 2012 14:56:56 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87wr32ccif.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>> Ben Morrow <ben@morrow.me.uk> writes:
>> > Quoth math math <mathematisch@gmail.com>:
>> >> I have a file with two tab delimited fields. First field is an ID, the
>> >> second field is a large string (up to hundreds of millions of
>> >> characters). The file may have many lines.
>> 
>> [...]
>> 
>> 
>> >> I would like to sort the file on the first (ID) field and after this
>> >> sorting, merge the second fields (i.e. remove the new lines), so that
>> >> I get a single line with many hundreds of lines that are in the same
>> >> order appended to each other as their alphabetically sorted IDs.
>> 
>> [...]
>> 
>> > I would do this in two passes. Start by reading through the file a block
>> > at a time, and finding all the ID fields. (I am assuming these are small
>> > enough that you aren't worried about keeping the whole list in
>> > memory).
>> 
>> If the file is large and entries with identical ID are not somehow
>> clustered, this will result in a lot of (useless) I/O.
>
> How so?

Because the complete contents of the file won't fit into the kernel
page cache and this means whenever a block of data needs to be read
which presently isn't in the page cache, one of the existing blocks
needs to be evicted and the data read 'from disk'. This can happen
numerous times when randomly seeking back and forth within the file.

The same phenomenon will occur at the perl buffering level except that
it will likely be much more severe (in terms of system-call overhead)
because the perl-level buffer will likely (this may be wrong) be much
smaller than the kernel page cache.

[...]

>> Minus some simplifications, this is a much better idea than building
>> an ID array in the way I suggested (all caveats still apply).
>> 
>> -----------------------
>> #!/usr/bin/perl
>> #
>> 
>> sub get_ids
>> {
>>     my ($in, $ids) = @_;
>>     my ($line, $the_id, $pos);
>> 
>>     $pos = tell($in);
>>     while ($line = <$in>) {
>
> You're still reading an entire line into memory.

Can you imagine that I know that? Actually, that just about everyone
reading your text will know that?

If you want an opinion on this: If a single line of input is too large
to be kept in memory, Perl is decidedly the wrong choice for solving
this problem. 


>> 	$line =~ /^([^\t]+\t)/ and $the_id = $1;
>> 	push(@{$ids->{$the_id}}, $pos + length($the_id));
>
> Ah, you are assuming IDs might be duplicated. I was assuming they were
> unique, and just needed sorting. The OP will have to clarify this.

The idea I got from the text of the OP was that he wanted to turn
multiple entries for a given ID into a single line while continuing to
have multiple entries for different IDs. After rereading his text, I
think that was probably wrong. An simple implementation of the
'concatenate everything in ID-sorted order' with a sensible I/O
stragey:

----------------
#!/usr/bin/perl
#

use Errno qw(EMFILE ENFILE);

{
    my ($in, $open, $id, %ids, @open, $input);

    while ($input = <STDIN>) {
	($id) = $input =~ /^([^\t]+)\t/;
	chop($input);

	until (defined(open($out, '+>', $id))) {
	    die("open: $in: $!")
		unless ($! == EMFILE || $! == ENFILE) && @open;

	    $ids{pop(@open)} = undef;
	}
	
	print $out (substr($input, length($id) + 1));

	$ids{$id} = $out;
	push(@open, $id);
	$out = undef;
    }

    for (sort(keys(%ids))) {
	$in = $ids{$_};
	
	if ($in) {
	    seek($in, 0, 0);
	} else {
	    until (defined(open($in, '<', $_))) {
		die("open/2: $in: $!")
		    unless $! == EMFILE || $! == ENFILE;

		1 while $open = pop(@open) and !$ids{$open};
		$open || die("noting to close");

		$ids{$open} = undef;
	    }
	}

	print(<$in>);
	$in = $ids{$_} = undef;
	unlink($_);
    }

    print("\n");
}


------------------------------

Date: Wed, 20 Jun 2012 07:08:50 -0700 (PDT)
From: math math <mathematisch@gmail.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <0766d8ed-da23-4c47-9b62-134dcdb7f3f8@googlegroups.com>

On Tuesday, June 19, 2012 6:55:00 PM UTC+1, Ben Morrow wrote:
> Quoth math math <mathematisch@gmail.com>:
> > 
> > I have a file with two tab delimited fields. First field is an ID, the
> > second field is a large string (up to hundreds of millions of
> > characters). The file may have many lines.
> 
> How large is the file altogether? A few hundred megabytes, or
> substantially larger? It may well not be worth trying to avoid doing
> this in memory.
> 
> > I would like to sort the file on the first (ID) field and after this
> > sorting, merge the second fields (i.e. remove the new lines), so that
> > I get a single line with many hundreds of lines that are in the same
> > order appended to each other as their alphabetically sorted IDs.
> 
> I don't entirely understand what you mean here. If you began with a file
> that looked like
> 
>     002 TWO
>     001 ONE
>     003 THREE
> 
> what would you expect to end up with? Perhaps you want
> 
>     ONE TWO THREE

That's correct.

> 
> with the IDs removed?
> 
> > Is there a way to do that in PERL without reading the whole file into
> > memory? 
> 
> I see other people have recommended sort(1); I would *not* recommend
> that, in this case. sort(1) will deal with large files by spilling out
> to temporary files on disk, but there's no need for that here.
> 

Would sort(1) still create large files if the sort field is only on the ID field (i.e. sort -k1,1)?  

The ID fields are very short, up to 10 chars.

> I would do this in two passes. Start by reading through the file a block
> at a time, and finding all the ID fields. (I am assuming these are small
> enough that you aren't worried about keeping the whole list in memory).
> 
> For each ID, remember the start and finish offset of the second field
> (using tell, or by keeping a running count of where you are in the file,
> whichever's easier). Put these in a hash keyed by ID.
> 
> Then, you can sort the list of IDs, and, for each ID, seek to the right
> place in the old file and pull out the data item for the new file
> directly.
> 
> Ben



------------------------------

Date: Wed, 20 Jun 2012 07:12:05 -0700 (PDT)
From: math math <mathematisch@gmail.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <31b93e20-c043-4df2-ab80-488ec534f5e3@googlegroups.com>

On Wednesday, June 20, 2012 2:56:56 PM UTC+1, Rainer Weikusat wrote:
> Ben Morrow <ben@morrow.me.uk> writes:
> > Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> >> Ben Morrow <ben@morrow.me.uk> writes:
> >> > Quoth math math <mathematisch@gmail.com>:
> >> >> I have a file with two tab delimited fields. First field is an ID, the
> >> >> second field is a large string (up to hundreds of millions of
> >> >> characters). The file may have many lines.
> >> 
> >> [...]
> >> 
> >> 
> >> >> I would like to sort the file on the first (ID) field and after this
> >> >> sorting, merge the second fields (i.e. remove the new lines), so that
> >> >> I get a single line with many hundreds of lines that are in the same
> >> >> order appended to each other as their alphabetically sorted IDs.
> >> 
> >> [...]
> >> 
> >> > I would do this in two passes. Start by reading through the file a block
> >> > at a time, and finding all the ID fields. (I am assuming these are small
> >> > enough that you aren't worried about keeping the whole list in
> >> > memory).
> >> 
> >> If the file is large and entries with identical ID are not somehow
> >> clustered, this will result in a lot of (useless) I/O.
> >
> > How so?
> 
> Because the complete contents of the file won't fit into the kernel
> page cache and this means whenever a block of data needs to be read
> which presently isn't in the page cache, one of the existing blocks
> needs to be evicted and the data read 'from disk'. This can happen
> numerous times when randomly seeking back and forth within the file.
> 
> The same phenomenon will occur at the perl buffering level except that
> it will likely be much more severe (in terms of system-call overhead)
> because the perl-level buffer will likely (this may be wrong) be much
> smaller than the kernel page cache.
> 
> [...]
> 
> >> Minus some simplifications, this is a much better idea than building
> >> an ID array in the way I suggested (all caveats still apply).
> >> 
> >> -----------------------
> >> #!/usr/bin/perl
> >> #
> >> 
> >> sub get_ids
> >> {
> >>     my ($in, $ids) = @_;
> >>     my ($line, $the_id, $pos);
> >> 
> >>     $pos = tell($in);
> >>     while ($line = <$in>) {
> >
> > You're still reading an entire line into memory.
> 
> Can you imagine that I know that? Actually, that just about everyone
> reading your text will know that?
> 
> If you want an opinion on this: If a single line of input is too large
> to be kept in memory, Perl is decidedly the wrong choice for solving
> this problem. 
> 
> 
> >> 	$line =~ /^([^\t]+\t)/ and $the_id = $1;
> >> 	push(@{$ids->{$the_id}}, $pos + length($the_id));
> >
> > Ah, you are assuming IDs might be duplicated. I was assuming they were
> > unique, and just needed sorting. The OP will have to clarify this.
> 
> The idea I got from the text of the OP was that he wanted to turn
> multiple entries for a given ID into a single line while continuing to
> have multiple entries for different IDs. After rereading his text, I
> think that was probably wrong. An simple implementation of the
> 'concatenate everything in ID-sorted order' with a sensible I/O
> stragey:
> 

Hm, I don't understand right away what's really going on below in the script, I will try to decipher it.

> ----------------
> #!/usr/bin/perl
> #
> 
> use Errno qw(EMFILE ENFILE);
> 
> {
>     my ($in, $open, $id, %ids, @open, $input);
> 
>     while ($input = <STDIN>) {
> 	($id) = $input =~ /^([^\t]+)\t/;
> 	chop($input);
> 
> 	until (defined(open($out, '+>', $id))) {
> 	    die("open: $in: $!")
> 		unless ($! == EMFILE || $! == ENFILE) && @open;
> 
> 	    $ids{pop(@open)} = undef;
> 	}
> 	
> 	print $out (substr($input, length($id) + 1));
> 
> 	$ids{$id} = $out;
> 	push(@open, $id);
> 	$out = undef;
>     }
> 
>     for (sort(keys(%ids))) {
> 	$in = $ids{$_};
> 	
> 	if ($in) {
> 	    seek($in, 0, 0);
> 	} else {
> 	    until (defined(open($in, '<', $_))) {
> 		die("open/2: $in: $!")
> 		    unless $! == EMFILE || $! == ENFILE;
> 
> 		1 while $open = pop(@open) and !$ids{$open};
> 		$open || die("noting to close");
> 
> 		$ids{$open} = undef;
> 	    }
> 	}
> 
> 	print(<$in>);
> 	$in = $ids{$_} = undef;
> 	unlink($_);
>     }
> 
>     print("\n");
> }



------------------------------

Date: Wed, 20 Jun 2012 15:23:57 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87k3z2cb9e.fsf@sapphire.mobileactivedefense.com>

math math <mathematisch@gmail.com> writes:

[...]

>> An simple implementation of the
>> 'concatenate everything in ID-sorted order' with a sensible I/O
>> stragey:
>> 
>
> Hm, I don't understand right away what's really going on below in the script, I will try to decipher it.

That's what I think to be the best idea here (this may be wrong): It
splits the input file into files each containing the data part of a
single line while recording the IDs. Afterwards, it sorts the IDs and
concatenates the temporary files into an output file in that order.
There's some additional logic in order to keep as many of the
temporary files open until the output stage as the OS supports.


------------------------------

Date: Wed, 20 Jun 2012 07:17:23 -0700 (PDT)
From: math math <mathematisch@gmail.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <6f8ac5de-e3ae-4b99-8c2e-bdb3488e4bcd@googlegroups.com>

On Tuesday, June 19, 2012 5:47:28 PM UTC+1, J. Gleixner wrote:
> On 06/19/12 10:50, math math wrote:
> > Hi,
> >
> > I have a file with two tab delimited fields. First field is an ID, the =
second field is a large string (up to hundreds of millions of characters). =
The file may have many lines.
> >
> > I would like to sort the file on the first (ID) field and after this so=
rting, merge the second fields (i.e. remove the new lines), so that I get a=
 single line with many hundreds of lines that are in the same order appende=
d to each other as their alphabetically sorted IDs.
> >
> > Is there a way to do that in PERL without reading the whole file into m=
emory?
> >
> > Thanks
>=20
>=20
> What have you tried?
>=20

Sorry, I should have pasted my trial code first, I did it right after my in=
itial post, but it got here a bit late. Please see my second post for detai=
ls (there is an ordering mistake in the snipper, "chomp" should be below th=
e line with the "split")

> Probably sorting it first would make it much easier:
>=20
> man sort
>=20
Indeed, I tried sort first, it works, it is more of a scalability question =
really.

> These should help with the rest:
>=20
> perldoc -f open
> perldoc -f chomp

Thanks.


------------------------------

Date: Wed, 20 Jun 2012 16:29:56 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87395qc87f.fsf@sapphire.mobileactivedefense.com>

math math <mathematisch@gmail.com> writes:
>> On 06/19/12 10:50, math math wrote:
>> > Hi,
>> >
>>> I have a file with two tab delimited fields. First field is an
>>> ID, the second field is a large string (up to hundreds of
>>> millions of characters). The file may have many lines.
>>>
>>> I would like to sort the file on the first (ID) field and after
>>> this sorting, merge the second fields (i.e. remove the new lines),
>>> so that I get a single line with many hundreds of lines that are
>>> in the same order appended to each other as their alphabetically
>>> sorted IDs.

[...]

>> Probably sorting it first would make it much easier:
>> 
>> man sort
>> 
> Indeed, I tried sort first, it works, it is more of a scalability
> question really.

This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.


------------------------------

Date: Tue, 19 Jun 2012 17:51:58 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: RegEx III (WAS: an effective script for grabbing and putting images from or to a website)
Message-Id: <6c72u71roq85b3h2mhihknk9jl1brhhnqp@4ax.com>

Cal Dershowitz <cal@example.invalid> wrote:
>One thing I didn't appreciate in such character 
>classes was that the order was being preserved, so you rely on it not 
>coming back "gpj."

You are thinking about REs the wrong way. It is not the class or the RE
that is "coming back". REs match (part of) a string. And the capture
always contains that part of the original string that was matched, no
matter how or which RE matched it. 
Therefore your mental model of "the character class is returning
something" is very misleading and you should dump it asap.

jue


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3721
***************************************


home help back first fref pref prev next nref lref last post