[32455] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3722 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jun 21 14:09:21 2012

Date: Thu, 21 Jun 2012 11:09:06 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 21 Jun 2012     Volume: 11 Number: 3722

Today's topics:
    Re: an effective script for grabbing and putting images <ben@morrow.me.uk>
        Losing $1 and $2 variables in my search and replace exp laredotornado@zipmail.com
    Re: Losing $1 and $2 variables in my search and replace <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <ben@morrow.me.uk>
    Re: question concerning pipes and large strings <ben@morrow.me.uk>
    Re: question concerning pipes and large strings <ben@morrow.me.uk>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <ben@morrow.me.uk>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <m@rtij.nl.invlalid>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
    Re: question concerning pipes and large strings <rweikusat@mssgmbh.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 20 Jun 2012 19:40:53 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: an effective script for grabbing and putting images from or to a website
Message-Id: <l2i9b9-p8m2.ln1@anubis.morrow.me.uk>


Quoth Cal Dershowitz <cal@example.invalid>:
> On 06/19/2012 12:29 PM, Ben Morrow wrote:
> >
> >      for my $name (@files) {
> >          my ($ext) = $name =~ /([^.]*)$/;
> >
> >          my @matching = grep /\.$ext$/, @list;
> >          @matching = grep /image_\d+/, @matching;
> >          @matching = sort @matching;
> >
> >          my $winner = pop @matching;
> >      }
> 
> What is the precise roll of the second dollar sign above in the first 
> grep call?

(role? (rôlé?))

The same as the dollar in the first pattern match: it ensures the
pattern only matches at the end of the string. It is in principle a
little confusing that dollar is used for both end-of-string and to
introduce a variable, but in practice perl almost never misinterprets
you.

> >      @matching = map /image_(\d+)/, @matching;
> >
> > This will leave @matching containing just a list of numbers, so then you
> > can say
> >
> >      my $newnum = $winner + 1;
> >      my $newfile = "image_$newnum.$ext";
> >
> > to build a new filename.
> 
> This was all good till I hit 11.  See below.
<snip>
> name is image_10.png
> ext is png
> matching is 2 3 4 5 6 7 8 9 10
> newfile is image_10.png
> #  commenting on output HERE
> Cannot open Local file image_10.png: No such file or directory
>   at upload14.pl line 59
> put failed No such file or directory

Does image_10.png exist locally? What happens if you print out @files
before the loop? I can't see any reason for it to be in the list if it
doesn't exist locally, either.

<snip>
> # get files from Desktop/images/
> my $path  = '/home/dan/Desktop/upload_luther/';
> my @files = <$path*>;
> 
> # get ls from remote image directory
> $ftp->cwd('/images/') or die "cwd failed $@\n";
> my @list = $ftp->ls();
> 
> # main control
> for my $name (@files) {
>      print "name is $name\n";
>      my ($ext) = $name =~ /([^.]*)$/;
>      print "ext is $ext\n";
> 
>      @matching = map /image_(\d+)\.$ext$/, @list;
>      print "matching is @matching\n";
>      push( @matching, 1 );
>      @matching = sort @matching;
>      $winner   = pop @matching;
>      my $newnum    = $winner + 1;
>      my $new_file2 = "image_$newnum.$ext";
>      print "newfile is $new_file2\n";
>      $ftp->put( $name, $new_file2 ) or die "put failed $!\n";
>      push( @list, $new_file2 );
> 
> }
<snip>
> 
> Also, I think the appropriate html for an image includes its height and 
> width.  I know that's a trick the Imagemagick does, but does someone 
> know a slick way to get such data using perl syntax?

I would use Image::Size, which you will need to install from CPAN. If
you haven't got set up with CPAN yet, I would recommend using cpanminus
rather than CPAN.pm; download the file http://cpanmin.us, save it
somewhere as 'cpanm', and then run

    perl cpanm App::cpanminus

You can then delete the downloaded copy. Assuming you are using your
system perl, you will need to do this as root, so it can write to the
system perl library. Then run (also as root)

    cpanm Image::Size

which will install Image::Size and its dependencies.

(Obviously you need to think *VERY* *CAREFULLY* before running commands
suggested by some random person on Usenet as root. You may want to read
http://search.cpan.org/~miyagawa/App-cpanminus-1.5014/lib/App/cpanminus.pm
before you start. I take no responsibility if it breaks your system,
kills your cat, &c. &c.)

Alternatively, if you are using an OS with a package management system,
it may be better to install perl modules using that where possible. In
principle cpanm knows how to use local::lib to install modules under
your home directory, but I've never used that feature so I don't know
how well it works.

Ben



------------------------------

Date: Thu, 21 Jun 2012 08:49:23 -0700 (PDT)
From: laredotornado@zipmail.com
Subject: Losing $1 and $2 variables in my search and replace expression
Message-Id: <842ff129-3da5-4e74-be0e-7db5eef5b512@googlegroups.com>

Hi,

I'm using Perl 5.12.3 on Mac 10.7.4.  I have a file with lines that look like

AK=Alaska
AL-Alabama
 ...

and when I run this search and replace expression

perl -pi -e "s/(..)=(.*)/INSERT INTO cb_states (ABBREV, NAME) VALUES ('$1', '$2');/g" states.properties

The values of "$1" and "$2" result in empty strings in my file.  That is, the output in the file is

INSERT INTO cb_states (ABBREV, NAME) VALUES ('', '');
INSERT INTO cb_states (ABBREV, NAME) VALUES ('', '');

What is the correct way to write the command line expression above to properly insert the matched values?  Thanks,- Dave


------------------------------

Date: Thu, 21 Jun 2012 17:43:32 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Losing $1 and $2 variables in my search and replace expression
Message-Id: <87ehp8hayz.fsf@sapphire.mobileactivedefense.com>

laredotornado@zipmail.com writes:
> I'm using Perl 5.12.3 on Mac 10.7.4.  I have a file with lines that look like
>
> AK=Alaska
> AL-Alabama
> ...
>
> and when I run this search and replace expression
>
> perl -pi -e "s/(..)=(.*)/INSERT INTO cb_states (ABBREV, NAME) VALUES ('$1', '$2');/g" states.properties
>
> The values of "$1" and "$2" result in empty strings in my file.  That is, the output in the file is
>
> INSERT INTO cb_states (ABBREV, NAME) VALUES ('', '');
> INSERT INTO cb_states (ABBREV, NAME) VALUES ('', '');
>
> What is the correct way to write the command line expression above
> to properly insert the matched values?

Since the script is a double-quoted string, the shell will do variable
interpolations on it before passing it to perl, substituting its $1
and $2 into the expresssion. Since you want perl to do the
substituation, you need to stop the shell from doing so. One way:

"s/(..)=(.*)/INSERT INTO cb_states (ABBREV, NAME) VALUES ('\$1', '\$2');/g"


------------------------------

Date: Wed, 20 Jun 2012 20:34:53 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: question concerning pipes and large strings
Message-Id: <t7l9b9-stm2.ln1@anubis.morrow.me.uk>


Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> Ben Morrow <ben@morrow.me.uk> writes:
> > Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> >> Ben Morrow <ben@morrow.me.uk> writes:
> >> 
> >> > I would do this in two passes. Start by reading through the file a block
> >> > at a time, and finding all the ID fields. (I am assuming these are small
> >> > enough that you aren't worried about keeping the whole list in
> >> > memory).
> >> 
> >> If the file is large and entries with identical ID are not somehow
> >> clustered, this will result in a lot of (useless) I/O.
> >
> > How so?
> 
> Because the complete contents of the file won't fit into the kernel
> page cache and this means whenever a block of data needs to be read
> which presently isn't in the page cache, one of the existing blocks
> needs to be evicted and the data read 'from disk'. This can happen
> numerous times when randomly seeking back and forth within the file.

The part you quoted reads through the entire file exactly once, in
order. There is no seeking. I agree there will be a lot of seeking
later, once the list of IDs is sorted, but I don't see that can be
avoided unless the file fits into memory.

> The same phenomenon will occur at the perl buffering level except that
> it will likely be much more severe (in terms of system-call overhead)
> because the perl-level buffer will likely (this may be wrong) be much
> smaller than the kernel page cache.

That depends on the filesystem and OS, of course, but yes, in general I
would assume this would be the case. I'm not sure it matters: in a
process as IO-bound as this, the system call overhead is likely to be
irrelevant.

> >> #!/usr/bin/perl
> >> #
> >> 
> >> sub get_ids
> >> {
> >>     my ($in, $ids) = @_;
> >>     my ($line, $the_id, $pos);
> >> 
> >>     $pos = tell($in);
> >>     while ($line = <$in>) {
> >
> > You're still reading an entire line into memory.
> 
> Can you imagine that I know that? Actually, that just about everyone
> reading your text will know that?
> 
> If you want an opinion on this: If a single line of input is too large
> to be kept in memory, Perl is decidedly the wrong choice for solving
> this problem. 

I strongly disagree. IMHO a file with lines that long simply isn't
meaningfully a 'text file' any more, and so needs to be handled like a
binary file: read in blocks, and remember byte positions. Perl is
perfectly capable of handling binary data.

A simple loop that reads a fixed-size block at a time, searches that
block for tab and newline characters, and remembers their positions is
pretty straightforward to write; the only difficult bit is dealing with
the case where an ID crosses a block boundary.

    local $/ = \10240;
    local $_ = "\n";
    my $pos = -1;
    my %id;

    while (my $line = <$file>) {
        $_ .= $line;

        # There may be off-by-one errors here; I haven't tested it.
        my $last;
        while (/\n([^\t]+)\t/gc) {
            $last and $id{$last}[1] = $pos + $-[1] - 2;
            $id{$1}[0] = $pos + $+[1] + 1;
            $last = $1;
        }
        $last and $id{$last}[1] = $pos + pos;

        s/.*(\G)[^\n]*//s;
        $pos += $+[0];
    }

That's not very neat: I don't really like $last, but I haven't thought
terribly hard about how I might get rid of it. It also assumes the file
format is strictly correct, with exactly one tab per line and a newline
at the end of the file.

> >> 	$line =~ /^([^\t]+\t)/ and $the_id = $1;
> >> 	push(@{$ids->{$the_id}}, $pos + length($the_id));
> >
> > Ah, you are assuming IDs might be duplicated. I was assuming they were
> > unique, and just needed sorting. The OP will have to clarify this.
> 
> The idea I got from the text of the OP was that he wanted to turn
> multiple entries for a given ID into a single line while continuing to
> have multiple entries for different IDs. After rereading his text, I
> think that was probably wrong. An simple implementation of the
> 'concatenate everything in ID-sorted order' with a sensible I/O
> stragey:
> 
> ----------------
> #!/usr/bin/perl
> #
> 
> use Errno qw(EMFILE ENFILE);
> 
> {
>     my ($in, $open, $id, %ids, @open, $input);
> 
>     while ($input = <STDIN>) {
> 	($id) = $input =~ /^([^\t]+)\t/;
> 	chop($input);
> 
> 	until (defined(open($out, '+>', $id))) {
> 	    die("open: $in: $!")
> 		unless ($! == EMFILE || $! == ENFILE) && @open;

Copying the data out to temporary files and then reading it back in
again is *bound* to be more IO than seeking in the original file.

Ben



------------------------------

Date: Wed, 20 Jun 2012 20:41:56 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: question concerning pipes and large strings
Message-Id: <4ll9b9-stm2.ln1@anubis.morrow.me.uk>


Quoth math math <mathematisch@gmail.com>:
> On Tuesday, June 19, 2012 6:55:00 PM UTC+1, Ben Morrow wrote:
> > 
> > I see other people have recommended sort(1); I would *not* recommend
> > that, in this case. sort(1) will deal with large files by spilling out
> > to temporary files on disk, but there's no need for that here.
> 
> Would sort(1) still create large files if the sort field is only on the
> ID field (i.e. sort -k1,1)?  

I believe so, yes, but you would have to try it and/or read the source
for your sort(1) to be sure.

Ben



------------------------------

Date: Wed, 20 Jun 2012 21:11:01 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: question concerning pipes and large strings
Message-Id: <lbn9b9-r5n2.ln1@anubis.morrow.me.uk>


Quoth Ben Morrow <ben@morrow.me.uk>:
> 
> The part you quoted reads through the entire file exactly once, in
> order. There is no seeking. I agree there will be a lot of seeking
> later, once the list of IDs is sorted, but I don't see that can be
> avoided unless the file fits into memory.

Correction: if you were really determined to avoid excess IO, it would
be possible to work out where in the destination file each source line
should go, by summing the lengths of the preceding lines, and then read
through the source file sequentially again, seeking in the destination
file instead.

Ben



------------------------------

Date: Wed, 20 Jun 2012 21:20:39 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87k3z120rs.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>> Ben Morrow <ben@morrow.me.uk> writes:
>> > Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>> >> Ben Morrow <ben@morrow.me.uk> writes:
>> >>> I would do this in two passes. Start by reading through the file a block
>> >>> at a time, and finding all the ID fields. (I am assuming these are small
>> >>> enough that you aren't worried about keeping the whole list in
>> >>> memory).
>> >> 
>> >> If the file is large and entries with identical ID are not somehow
>> >> clustered, this will result in a lot of (useless) I/O.
>> >
>> > How so?
>> 
>> Because the complete contents of the file won't fit into the kernel
>> page cache and this means whenever a block of data needs to be read
>> which presently isn't in the page cache, one of the existing blocks
>> needs to be evicted and the data read 'from disk'. This can happen
>> numerous times when randomly seeking back and forth within the file.
>
> The part you quoted reads through the entire file exactly once, in
> order. There is no seeking.I agree there will be a lot of seeking
> later,

Could you perhaps enlighthen me what the point of this remark is
supposed to be? Since I wrote about seeking and seeking is a necessary
part of the complete solution, it should be blatantly obivious that I
was referring to that.

[...]

>> The same phenomenon will occur at the perl buffering level except that
>> it will likely be much more severe (in terms of system-call overhead)
>> because the perl-level buffer will likely (this may be wrong) be much
>> smaller than the kernel page cache.
>
> That depends on the filesystem and OS, of course, but yes, in general I
> would assume this would be the case. I'm not sure it matters: in a
> process as IO-bound as this, the system call overhead is likely to be
> irrelevant.

Let's assume for the sake of example that 'all of the I/O' takes 30
days. The kilotons of useless system calls take an hour. This is about
0.14% of the first time span. Would you want to wait an additional
hour for this task to complete just because it is only 0.14%? Or would
you rather want to avoid this hour as well? Can you imagine that
someone who just wants to use the code could want to avoid this hour?
Or that someone could want to do something more useful with this hour
of CPU time than 'burn it with useless system calls'?

>> >> #!/usr/bin/perl
>> >> #
>> >> 
>> >> sub get_ids
>> >> {
>> >>     my ($in, $ids) = @_;
>> >>     my ($line, $the_id, $pos);
>> >> 
>> >>     $pos = tell($in);
>> >>     while ($line = <$in>) {
>> >
>> > You're still reading an entire line into memory.
>> 
>> Can you imagine that I know that? Actually, that just about everyone
>> reading your text will know that?
>> 
>> If you want an opinion on this: If a single line of input is too large
>> to be kept in memory, Perl is decidedly the wrong choice for solving
>> this problem. 
>
> I strongly disagree. IMHO a file with lines that long simply isn't
> meaningfully a 'text file' any more, and so needs to be handled like a
> binary file: read in blocks, and remember byte positions. Perl is
> perfectly capable of handling binary data.

As opposed to something sensible such as memory-mapping the input
file (or that part of it which fits into the process address space),
the overhead is going to be grotesque and when 'low-level buffer
control' has to be done in any case, this overhead is not justified.

[...]

>> ----------------
>> #!/usr/bin/perl
>> #
>> 
>> use Errno qw(EMFILE ENFILE);
>> 
>> {
>>     my ($in, $open, $id, %ids, @open, $input);
>> 
>>     while ($input = <STDIN>) {
>> 	($id) = $input =~ /^([^\t]+)\t/;
>> 	chop($input);
>> 
>> 	until (defined(open($out, '+>', $id))) {
>> 	    die("open: $in: $!")
>> 		unless ($! == EMFILE || $! == ENFILE) && @open;
>
> Copying the data out to temporary files and then reading it back in
> again is *bound* to be more IO than seeking in the original file.

No, it is not bound to be more I/O (neither physical nor 'system call
I/O') because everything is done strictly sequential and will thus
interact nicely with any intermediate buffering layers: There's never
a need to 'flush the buffer, read a different block into it, flush the
buffer again, reread the original block'. If this helps with the issue
at hand would be a different question: It certainly will if the buffer
is larger than any indivividual data item. This might not be the case
here and consequently, both approaches should be tested to determine
which one is better.


------------------------------

Date: Wed, 20 Jun 2012 23:38:17 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87395p1uee.fsf@sapphire.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mssgmbh.com> writes:

[...]

>>> If you want an opinion on this: If a single line of input is too large
>>> to be kept in memory, Perl is decidedly the wrong choice for solving
>>> this problem. 
>>
>> I strongly disagree. IMHO a file with lines that long simply isn't
>> meaningfully a 'text file' any more, and so needs to be handled like a
>> binary file: read in blocks, and remember byte positions. Perl is
>> perfectly capable of handling binary data.
>
> As opposed to something sensible such as memory-mapping the input
> file (or that part of it which fits into the process address space),
> the overhead is going to be grotesque

For sake of completeness and because I felt like doing it: Here's a
seek-based variant based on reading data in 'blocks' (of an arbitrary
size > 0) which actually works (according to my limited testing, still
without error handling).

----------------------------
#!/usr/bin/perl
#

use constant BLOCK =>	4096;

sub read_block
{
    my ($block, $rc);

    $rc = sysread($_[0], $block, $_[1] // BLOCK);
    $rc // die("sysread: $!");
    $_[1] && $rc != $_[1] && die("short read");

    return $rc ? $block : undef;
}

sub get_ids
{
    my ($in, $ids) = @_;
    my ($block, $id, $want, $bpos, $sbpos, $fpos);

    $want = "\t";
    while ($block = read_block($in)) {
	$sbpos = $bpos = 0;

	{
	    $bpos = index($block, $want, $sbpos);

	    if ($want eq "\t") {
		if ($bpos != -1) {
		    $id .= substr($block, $sbpos, $bpos - $sbpos);
		    push(@$ids, [$id, $fpos + ++$bpos]);
		    $id = '';

		    $want = "\n";
		    $sbpos = $bpos;
		    redo if $sbpos < length($block);
		} else {
		    $id .= substr($block, $sbpos);
		}

		last;
	    }

	    if ($want eq "\n") {
		last if $bpos == -1;

		push(@{$ids->[$#$ids]}, $fpos + $bpos);

		$want = "\t";
		$sbpos = $bpos + 1;
		redo if $sbpos < length($block);
	    }
	}
	
	$fpos += length($block);
    }	
}	

sub print_id_data
{
    my ($fh, $id) = @_;
    my ($blocks, $len);

    seek($fh, $id->[1], 0);
    $len = $id->[2] - $id->[1];

    $blocks = int($len / BLOCK);
    print(read_block($fh, BLOCK)) while ($blocks--);

    $len %= BLOCK;
    print(read_block($fh, $len)) if $len;
}
    
{
    my ($fh, @ids);

    open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");
    
    get_ids($fh, \@ids);
    print_id_data($fh, $_) for sort { $a->[0] cmp $b->[0] } @ids;
}


------------------------------

Date: Wed, 20 Jun 2012 23:49:03 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: question concerning pipes and large strings
Message-Id: <vj0ab9-ci4.ln1@anubis.morrow.me.uk>


Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> Ben Morrow <ben@morrow.me.uk> writes:
> > Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> 
> >> The same phenomenon will occur at the perl buffering level except that
> >> it will likely be much more severe (in terms of system-call overhead)
> >> because the perl-level buffer will likely (this may be wrong) be much
> >> smaller than the kernel page cache.
> >
> > That depends on the filesystem and OS, of course, but yes, in general I
> > would assume this would be the case. I'm not sure it matters: in a
> > process as IO-bound as this, the system call overhead is likely to be
> > irrelevant.
> 
> Let's assume for the sake of example that 'all of the I/O' takes 30
> days. The kilotons of useless system calls take an hour.

'IO-bound' means that for most of its runtime, the process will be
sitting there doing exactly nothing, waiting for IO to complete. The
syscall overhead will eat into the time spent doing nothing, but will
(probably) not increase the total runtime at all.

And no, I wouldn't care if a process which ran for 30 days overran by an
hour. There are more important things to worry about; that level of
inefficiency is almost certainly in the noise in any case. (That is, I
would expect runtimes of 30 days to vary by more than a hour from run to
run, for no predictable reason.)

> > Copying the data out to temporary files and then reading it back in
> > again is *bound* to be more IO than seeking in the original file.
> 
> No, it is not bound to be more I/O (neither physical nor 'system call
> I/O') because everything is done strictly sequential and will thus
> interact nicely with any intermediate buffering layers: There's never
> a need to 'flush the buffer, read a different block into it, flush the
> buffer again, reread the original block'.

Copying a line into a second file means the system has to keep two
copies of that line in its buffer cache, one for each file. This greatly
increases the chance that lines will start to get thrown out of the
cache. No, you don't have to reread a block you've already read; you do,
however, have to read a textually-identical block from a different file,
which comes to the same thing. Not to mention that writes are generally
more expensive than reads, and you would be writing all the data out
twice rather than once.

Ben



------------------------------

Date: Thu, 21 Jun 2012 01:50:47 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <878vfhbi8o.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>> Ben Morrow <ben@morrow.me.uk> writes:
>> > Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>> 
>> >> The same phenomenon will occur at the perl buffering level except that
>> >> it will likely be much more severe (in terms of system-call overhead)
>> >> because the perl-level buffer will likely (this may be wrong) be much
>> >> smaller than the kernel page cache.
>> >
>> > That depends on the filesystem and OS, of course, but yes, in general I
>> > would assume this would be the case. I'm not sure it matters: in a
>> > process as IO-bound as this, the system call overhead is likely to be
>> > irrelevant.
>> 
>> Let's assume for the sake of example that 'all of the I/O' takes 30
>> days. The kilotons of useless system calls take an hour.
>
> 'IO-bound' means that for most of its runtime, the process will be
> sitting there doing exactly nothing, waiting for IO to complete.

Really? Who would have guessed that ...


------------------------------

Date: Thu, 21 Jun 2012 02:43:48 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87ehp9phgr.fsf@sapphire.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mssgmbh.com> writes:

[...]

>     seek($fh, $id->[1], 0);

This should be sysseek (works here because no buffered input is ever
done on this filehandle).


------------------------------

Date: Thu, 21 Jun 2012 14:01:55 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87pq8s95to.fsf@sapphire.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mssgmbh.com> writes:

[...]

>>> ----------------
>>> #!/usr/bin/perl
>>> #
>>> 
>>> use Errno qw(EMFILE ENFILE);
>>> 
>>> {
>>>     my ($in, $open, $id, %ids, @open, $input);
>>> 
>>>     while ($input = <STDIN>) {
>>> 	($id) = $input =~ /^([^\t]+)\t/;
>>> 	chop($input);
>>> 
>>> 	until (defined(open($out, '+>', $id))) {
>>> 	    die("open: $in: $!")
>>> 		unless ($! == EMFILE || $! == ENFILE) && @open;
>>
>> Copying the data out to temporary files and then reading it back in
>> again is *bound* to be more IO than seeking in the original file.

For a small test I made, this approach is indeed hopeless but not
because of the additional data I/O but because of the overhead of
creating, deleting, opening etc 'a really large number of files'. But
my 'data parts' where tiny compared to the 'many millions' of the OP
and things might still be different for this case.


------------------------------

Date: Thu, 21 Jun 2012 16:11:25 +0200
From: Martijn Lievaart <m@rtij.nl.invlalid>
Subject: Re: question concerning pipes and large strings
Message-Id: <dlmbb9-l6g.ln1@news.rtij.nl>

On Wed, 20 Jun 2012 16:29:56 +0100, Rainer Weikusat wrote:

> math math <mathematisch@gmail.com> writes:
>>> On 06/19/12 10:50, math math wrote:
>>> > Hi,
>>> >
>>>> I have a file with two tab delimited fields. First field is an ID,
>>>> the second field is a large string (up to hundreds of millions of
>>>> characters). The file may have many lines.
>>>>
>>>> I would like to sort the file on the first (ID) field and after this
>>>> sorting, merge the second fields (i.e. remove the new lines),
>>>> so that I get a single line with many hundreds of lines that are in
>>>> the same order appended to each other as their alphabetically sorted
>>>> IDs.
> 
> [...]
> 
>>> Probably sorting it first would make it much easier:
>>> 
>>> man sort
>>> 
>> Indeed, I tried sort first, it works, it is more of a scalability
>> question really.
> 
> This is a really bad idea because sort will reorder the complete input
> lines, including the data part, possible/ probably multiple times for
> each input line, and this means a lot of copying of data which doesn't
> need to be copied since only the IDs are supposed to be sorted.

As GNU sort is rather optimized, I would benchmark this before making 
blanket statements like this.

Also, we don't know if efficiency is relevant. If it runs only once a 
month, at night, the OP probably does not care if it takes a few hours as 
opposed to a few minutes.

That said, the requirements are rather unique, so there is also a good 
chance that sort will handle these files abysmally bad, chewing up all 
memory and disk-I/O and effectively bringing the machine to it's knees.

So to the OP: Does sort -k1,1 run in acceptable time for you? If so, 
there is the first part of your answer, the second part is now rather 
trivial (or if you still run out of memory, more trivial than the 
original problem).

(And if you do run out of memory, ask yourself, do I need a "works 
always" solution, or does just adding more memory solve your immediate 
problem)

HTH,
M4


------------------------------

Date: Thu, 21 Jun 2012 15:34:34 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87mx3w3f9h.fsf@sapphire.mobileactivedefense.com>

Martijn Lievaart <m@rtij.nl.invlalid> writes:
> On Wed, 20 Jun 2012 16:29:56 +0100, Rainer Weikusat wrote:

[...]

>>>> man sort
>>>> 
>>> Indeed, I tried sort first, it works, it is more of a scalability
>>> question really.
>> 
>> This is a really bad idea because sort will reorder the complete input
>> lines, including the data part, possible/ probably multiple times for
>> each input line, and this means a lot of copying of data which doesn't
>> need to be copied since only the IDs are supposed to be sorted.
>
> As GNU sort is rather optimized, I would benchmark this before making 
> blanket statements like this.

'Rather optimmized' usually means the code is seriously convoluted
because it used to run faster on some piece of ancient hardware in
1997 for a single test case because of that. And not matter how
'optimized', a sort program needs to sort its input. Which involves
reordering it. Completely. In case of files which are too large for
the memory of a modern computer, this involves a real lot of copying
data around.

I suggest that you make some benchmarks before making blanket
statements like the one above.

> Also, we don't know if efficiency is relevant. If it runs only once a 
> month, at night, the OP probably does not care if it takes a few hours as 
> opposed to a few minutes.

Efficiency is always relevant except in a single case: The guy who has
to write the code is so busy with getting it to work at all that the
mere thought of having to try to make it work sensibly scares the shit
out of him and he tries to pass this competence-deficit as 'secret
advantage' when posing for others. Uusually, this will also always
involve a dedicated computer for testing and often, the people who are
going to use the code are not in the position to complain to the
person who wrote it, IOW, run-time efficiency doesn't matter because
it is someone elses problem.

Congratulate yourself to the happy situation you happen to be in. Stop
assuming that it is 'the universal situation'. Things might look
rather different if code is written for in-house use and supposed to
run on a computer which also provides VPN services for customers
coming from fifty different companies.


------------------------------

Date: Thu, 21 Jun 2012 17:36:51 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: question concerning pipes and large strings
Message-Id: <87lijghba4.fsf@sapphire.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mssgmbh.com> writes:
> Martijn Lievaart <m@rtij.nl.invlalid> writes:
>> On Wed, 20 Jun 2012 16:29:56 +0100, Rainer Weikusat wrote:
>
> [...]
>
>>>>> man sort
>>>>> 
>>>> Indeed, I tried sort first, it works, it is more of a scalability
>>>> question really.
>>> 
>>> This is a really bad idea because sort will reorder the complete input
>>> lines, including the data part, possible/ probably multiple times for
>>> each input line, and this means a lot of copying of data which doesn't
>>> need to be copied since only the IDs are supposed to be sorted.
>>
>> As GNU sort is rather optimized, I would benchmark this before making 
>> blanket statements like this.
>
> 'Rather optimmized' usually means the code is seriously convoluted
> because it used to run faster on some piece of ancient hardware in
> 1997 for a single test case because of that. And not matter how
> 'optimized', a sort program needs to sort its input. Which involves
> reordering it. Completely. In case of files which are too large for
> the memory of a modern computer, this involves a real lot of copying
> data around.
>
> I suggest that you make some benchmarks before making blanket
> statements like the one above.

On some random computer I just used for that, sorting a 1080M file
(4000000 lines) with sort using the first column as key and sending
output to /dev/null (average from three runs) comes out at 40.9s
wallclock/ 6.4 user/ 3.2sys. Using one of the Perl scripts I posted
(with unused code removed) to extract the ID from each input line and
sort the list of IDs takes 9.5w/ 7.8u/ 0.7s. I didn't check if sort
used temporary files for this but it doesn't really matter because
sort is guaranteed to lose out if the file is only large enough. In
this case, the volume of data it needed to deal with was 1,132,259,700
bytes while the Perl script only needed to sort 28,000,000 bytes. And
the average line length was only 283 bytes in this case, not the 'many
millions' of the original problem.



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3722
***************************************


home help back first fref pref prev next nref lref last post