[31555] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 2814 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Feb 11 14:09:33 2010

Date: Thu, 11 Feb 2010 11:09:13 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 11 Feb 2010     Volume: 11 Number: 2814

Today's topics:
    Re: capturing multiple patterns per line <source@netcom.com>
    Re: capturing multiple patterns per line <hjp-usenet2@hjp.at>
    Re: capturing multiple patterns per line sln@netherlands.com
    Re: capturing multiple patterns per line <jl_post@hotmail.com>
    Re: comparing lists <xhoster@gmail.com>
    Re: comparing lists <hjp-usenet2@hjp.at>
    Re: comparing lists <cartercc@gmail.com>
    Re: comparing lists <cartercc@gmail.com>
    Re: comparing lists <hjp-usenet2@hjp.at>
    Re: comparing lists <cartercc@gmail.com>
    Re: How to do variable-width look-behind? <derykus@gmail.com>
    Re: How to do variable-width look-behind? <jl_post@hotmail.com>
        Is a merge interval function available? <pengyu.ut@gmail.com>
    Re: Is a merge interval function available? <ben@morrow.me.uk>
    Re: look up very large table sln@netherlands.com
    Re: look up very large table <xhoster@gmail.com>
    Re: look up very large table <xhoster@gmail.com>
    Re: look up very large table <jurgenex@hotmail.com>
    Re: shebang and ubuntu <Phred@example.invalid>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 10 Feb 2010 18:03:24 -0800
From: David Harmon <source@netcom.com>
Subject: Re: capturing multiple patterns per line
Message-Id: <7YudnX8xe4Tt-O7WnZ2dnUVZ_jhi4p2d@earthlink.com>

On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
<cartercc@gmail.com> wrote,
>On Feb 5, 11:58 am, Willem <wil...@stack.nl> wrote:
>>   while (<DATA>)
>>   {
>>     push @urls, /<a.*?href="(.*?)"/gi;
>>   }
>
>Yes, yes, yes, you are entirely right. I thought that the non-greedy
>modifier might do the trick, but

Instead of .*? I think [^>]*? would be more accurate.



------------------------------

Date: Thu, 11 Feb 2010 13:21:41 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: capturing multiple patterns per line
Message-Id: <slrnhn7tim.10m.hjp-usenet2@hrunkner.hjp.at>

On 2010-02-11 02:03, David Harmon <source@netcom.com> wrote:
> On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
><cartercc@gmail.com> wrote,
>>On Feb 5, 11:58 am, Willem <wil...@stack.nl> wrote:
>>>   while (<DATA>)
>>>   {
>>>     push @urls, /<a.*?href="(.*?)"/gi;
>>>   }
>>
>>Yes, yes, yes, you are entirely right. I thought that the non-greedy
>>modifier might do the trick, but
>
> Instead of .*? I think [^>]*? would be more accurate.

Nope. ">" is allowed in a double-quoted parameter value.

	hp


------------------------------

Date: Thu, 11 Feb 2010 08:34:23 -0800
From: sln@netherlands.com
Subject: Re: capturing multiple patterns per line
Message-Id: <cj98n51fn2v41nv2fcvt4n76gcjst36lr4@4ax.com>

On Thu, 11 Feb 2010 13:21:41 +0100, "Peter J. Holzer" <hjp-usenet2@hjp.at> wrote:

>On 2010-02-11 02:03, David Harmon <source@netcom.com> wrote:
>> On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
>><cartercc@gmail.com> wrote,
>>>On Feb 5, 11:58 am, Willem <wil...@stack.nl> wrote:
>>>>   while (<DATA>)
>>>>   {
>>>>     push @urls, /<a.*?href="(.*?)"/gi;
>>>>   }
>>>
>>>Yes, yes, yes, you are entirely right. I thought that the non-greedy
>>>modifier might do the trick, but
>>
>> Instead of .*? I think [^>]*? would be more accurate.
>
>Nope. ">" is allowed in a double-quoted parameter value.
>
>	hp

In single quotes as well.
Yes, > is allowed in a double/single quote attval.
Its also allowed in content surrounded by quotes.

So, CC's regex will match: <a/>href=" > "
Clearly, a guard must be in place to thwart this.
[^>]*? is a good candidate but where do you put it?

CC's regex will also match: <aBBB  Zhref="some stuff"
So, its not really a good regex for this.

However, you can use [^>]*? to flesh out the tag-att/val form.
There are 5 or 6 sub-pattern forms in an expression.
At least 1 complete form for tag-att/val's is needed.

A complete sub-pattern (form), that will parse any tag-att/val
markup is this:
  <(?:($Name)(\s+(?:(?:".*?")|(?:'.*?')|(?:[^>]*?))+)\s*(\/?))>

Where, tag and ".*?" and '.*?' and [^>]*? consume all valid text between <>.
Easier said than done. After this a further parsing is necessary on the
capture groups to separate data and detect errors.

The form above can be combined  with the seconday parsing when there
is specifiic information available. Like CC's  <a href= .. data.
Still, a complete form is needed.

As a side note, xml is stricter than html when it comes to quoting
values in att/val pairs. Html is not so strict and allows for unquoted
vals and standalone unquoted attributes as well.
The form above accomodates both, strictures can be enforced later
and the bottom line is the *form* integrity is maintained in the stream
and does not overflow into invalid teritory.

So, CC's regex could be made into combined modified form,
though still inadequite because it is a standalone form where
other forms are missing that could negate the results.

Yes you were right about the ">", but without [^>]*? in a couple
of places, it won't work: 

  /<a\s+[^>]*?(?<=\s)href\s*=\s*["'](.+?)["'][^>]*?\s*\/?>/
  or
  /<a\s+[^>]*?(?<=\s)href\s*=\s*(".+?"|'.+?')[^>]*?\s*\/?>/  # quotes captured

-sln







------------------------------

Date: Thu, 11 Feb 2010 09:07:44 -0800 (PST)
From: "jl_post@hotmail.com" <jl_post@hotmail.com>
Subject: Re: capturing multiple patterns per line
Message-Id: <a57dcba7-5db0-4b2f-9cad-76c38dd8b347@m35g2000prh.googlegroups.com>

On Feb 5, 9:17=A0am, ccc31807 <carte...@gmail.com> wrote:
> Suppose I am parsing a file line by line, and I want to push to an
> array all substrings on that line that match a pattern. For example,
> consider the listing below. @urls SHOULD contain this: @urls =3D (http://
> google.com,http://yahoo.com,http://amazon.com,http://ebay.com)
> Instead, it contains only the last value. Using the g modifier doesn't
> help.

   (My apologies if someone has already answered to your
satisfaction.)

   Try using the /g modifier, changing "if" to "while", and changing
"<a.*href" to just "href" (since "a" and "href" are not guaranteed to
occur together on the same line).  So your script would look like:

-------listing---------------
use strict;
use warnings;
my @urls;
while (<DATA>)
{
   while (/href=3D"([^"]+)/g) { push @urls, $1; }
}

print join "\n", @urls;
__DATA__
<html>\n
<body>\n
<h1>My Favorite Sites</h1>\n
<p>\n
My favorite sites are <a href=3D"http://google.com">Google</a>, <a
href=3D"http://yahoo.com">Yahoo</a>, <a href=3D"http://
amazon.com">Amazon</
a>, and <a href=3D"http://ebay.com">Ebay</a>.\n
</p>\n
</body>\n
</html>\n
-------end of listing---------------

   (I added a call to join() in the print() statement to make the
output a little easier to read.)

   Running this modified program, I get as output:

http://google.com
http://yahoo.com
http://amazon.com
http://ebay.com

This is what you want, right?

   (And consider using the /i modifier, as HTML tags are not required
to be lower-case.)

   Hope this helps,

   -- Jean-Luc


------------------------------

Date: Wed, 10 Feb 2010 20:21:59 -0800
From: Xho Jingleheimerschmidt <xhoster@gmail.com>
Subject: Re: comparing lists
Message-Id: <4b739c6a$0$25660$ed362ca5@nr5-q3a.newsreader.com>

ccc31807 wrote:
> 
> During the next several weeks, I've been tasked with taking three data
> files, comparing the keys of each file, and if the keys are identical,
> processing the file but if not, printing out a list of differences,

With what information?  Just the name of the key that fails to appear in 
all files, or do you have to identify which one or two out of three it 
appears in?

> which in effect means printing out the different keys. The keys are
> all seven digit integers. (Each file is to be generated by a different
> query of the same database.)

Since you already got it in a database, how about something like:

select key, count(1) from (union of all three queries) group by key 
having count(1) != 3;

> Okay, I could use diff for this, but I'd like to do it
> programmatically. Using brute force, I could generate three files with
> just the keys and compare them line by line, but I'd like not to do
> this for several reason but mostly because the data files are pretty
> much guaranteed to be identical and we don't expect there to be any
> differences.

That reason doesn't make much sense.  The fact that the files are pretty 
much guaranteed to be identical can be used to argue against *any* 
proposed method, not just the line-by-line method.

> I'm thinking about hashing the keys in the three files and comparing
> the key digests, with the assumption that identical hashes means
> identical files.

I don't know of any hashing functions that have both a very low chance 
of collision, and are indifferent to the order in which the strings are 
added into it.  And if you have to sort the keys so they are in the same 
order, then you might as well do the line by line thing.

Xho


------------------------------

Date: Thu, 11 Feb 2010 13:20:27 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: comparing lists
Message-Id: <slrnhn7tgc.10m.hjp-usenet2@hrunkner.hjp.at>

On 2010-02-10 14:36, ccc31807 <cartercc@gmail.com> wrote:

[comparing three files]

> Okay, I could use diff for this, but I'd like to do it
> programmatically.

diff isn't a program?

> Using brute force, I could generate three files with
> just the keys and compare them line by line, but I'd like not to do
> this for several reason but mostly because the data files are pretty
> much guaranteed to be identical and we don't expect there to be any
> differences.

If the files are "pretty much guaranteed to be identical" you could just
compute a hash for each file and compare the hashes. If they are the
same, you are done. Only if they aren't (which is "pretty much
guaranteed" not to happen) do you need to worry about finding the
differences.

	hp



------------------------------

Date: Thu, 11 Feb 2010 06:50:45 -0800 (PST)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: comparing lists
Message-Id: <f933ec88-7322-48d9-9f1e-f7b9e4d3b3df@q27g2000yqn.googlegroups.com>

On Feb 11, 7:20=A0am, "Peter J. Holzer" <hjp-usen...@hjp.at> wrote:
> > Okay, I could use diff for this, but I'd like to do it
> > programmatically.
>
> diff isn't a program?

I process the (main) file with a Perl script, and I don't want to do
in two steps what I can do in one, that is, including a function in
the existing script to compare the three files.

> If the files are "pretty much guaranteed to be identical" you could just
> compute a hash for each file and compare the hashes. If they are the
> same, you are done. Only if they aren't (which is "pretty much
> guaranteed" not to happen) do you need to worry about finding the
> differences.

As it turns out, with a couple of days experience and several
attempts, I would up creating three hashes, one for each file, with
the IDs as keys and the name of the file as the values. I iterate
through the 'main' hash, and if the hash element exists in all three
hashes I delete it. I then print the hashes. It's kinda' crude, but it
was easy to do, doesn't take long, and gives me what I need.

Thanks, CC.


------------------------------

Date: Thu, 11 Feb 2010 06:58:29 -0800 (PST)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: comparing lists
Message-Id: <acc7131a-8279-49f9-9693-f789d8caf37e@o3g2000yqb.googlegroups.com>

On Feb 10, 11:21=A0pm, Xho Jingleheimerschmidt <xhos...@gmail.com>
wrote:
> With what information? =A0Just the name of the key that fails to appear i=
n
> all files, or do you have to identify which one or two out of three it
> appears in?

Just the key.

> Since you already got it in a database, how about something like:

Unfortunately, this is a non-SQL, non-relational, non-first-normal-
form flat file database (IBM's UniData) over a WAN connection, and
it's a lot more practical to glob the data and process it locally.


> > this for several reason but mostly because the data files are pretty
> > much guaranteed to be identical and we don't expect there to be any
> > differences.
>
> That reason doesn't make much sense. =A0The fact that the files are prett=
y
> much guaranteed to be identical can be used to argue against *any*
> proposed method, not just the line-by-line method.

See my reply to PJH. The 'official' query is highly impractical for my
unit, and we have written two other queries to replace it. We just
want to make sure that the data derived from all three queries is the
same before we make any changes.

> I don't know of any hashing functions that have both a very low chance
> of collision, and are indifferent to the order in which the strings are
> added into it. =A0And if you have to sort the keys so they are in the sam=
e
> order, then you might as well do the line by line thing.

Obviously, the keys would have to be in order. As it turns out, the
size of the files is much less than I anticipated, so O(n) works just
fine.

CC.


------------------------------

Date: Thu, 11 Feb 2010 17:46:11 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: comparing lists
Message-Id: <slrnhn8d2l.cp8.hjp-usenet2@hrunkner.hjp.at>

On 2010-02-11 14:50, ccc31807 <cartercc@gmail.com> wrote:
> On Feb 11, 7:20 am, "Peter J. Holzer" <hjp-usen...@hjp.at> wrote:
>> If the files are "pretty much guaranteed to be identical" you could just
>> compute a hash for each file and compare the hashes. If they are the
>> same, you are done. Only if they aren't (which is "pretty much
>> guaranteed" not to happen) do you need to worry about finding the
>> differences.
>
> As it turns out, with a couple of days experience and several
> attempts, I would up creating three hashes,

I just realized that my use of the word "hash" was ambiguous: I meant
result of a strong hash-function such as SHA-1, not a Perl hash.

	hp


------------------------------

Date: Thu, 11 Feb 2010 10:04:41 -0800 (PST)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: comparing lists
Message-Id: <907f3aeb-1cb5-46e0-a61e-0f615c6788de@g28g2000yqh.googlegroups.com>

On Feb 11, 11:46=A0am, "Peter J. Holzer" <hjp-usen...@hjp.at> wrote:
> I just realized that my use of the word "hash" was ambiguous: I meant
> result of a strong hash-function such as SHA-1, not a Perl hash.

That's okay. I figured out what you meant.

CC.


------------------------------

Date: Thu, 11 Feb 2010 00:55:31 -0800 (PST)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: How to do variable-width look-behind?
Message-Id: <c013751a-dfb6-4e36-bf5c-8b6aec04cfec@a5g2000prg.googlegroups.com>

On Feb 9, 3:23=A0pm, "jl_p...@hotmail.com" <jl_p...@hotmail.com> wrote:
> Hi,
>
> =A0 =A0I have a Perl script that processes multi-line input. =A0The probl=
em
> is, sometimes this input has newlines stuck in arbitrary places (such
> as right in the middle of a valid token). =A0This makes the input out-of-
> spec, but I have no control over this, so I want to correct it if I
> can. =A0What's more is, sometimes this newline breaks a token in two,
> where the first half still looks like a valid token while the other
> does not, and vice-versa.
>
> =A0 =A0I'm trying to modify my Perl script so that it reviews every
> newline and see if it should be discarded. =A0The logic I want to use is
> to throw out every newline UNLESS it is flanked (on both sides) by
> valid tokens. =A0I would like to be able to do something like this:
>
> =A0 =A0# Create a regular expression that matches tokens
> =A0 =A0# like "N50E40", "N50 E40", "N5000 E4000",
> =A0 =A0# "50N40E", "50N 40E", and "5000N4000E":
> =A0 =A0my $tokenRegExp =3D qr/\b(?:[NS]\d+\s*[EW]\d+|\d+[NS]\s*\d+[EW])\b=
/;
>
> =A0 =A0# Remove newlines that are not surrounded by valid tokens:
> =A0 =A0$input =3D~ s/(?<!$tokenRegExp)\n(?=3D$tokenRegExp)//g; =A0# no to=
ken
> before
> =A0 =A0$input =3D~ s/(?<=3D$tokenRegExp)\n(?!$tokenRegExp)//g; =A0# no to=
ken
> after
> =A0 =A0$input =3D~ s/(?<!$tokenRegExp)\n(?!$tokenRegExp)//g; =A0# no toke=
ns
>
> =A0 =A0The problem is is that the look-behind assertions (both positive
> and negative) only work for fixed-width expressions, according to
> "perldoc perlre". =A0Unfortunately, it would be so useful for me to be
> able to match a string with a variable look-behind, that I'm hoping
> there's a logical work-around to this limitation.
>
> =A0 =A0Is there any way for me to work around this limitation?

IIUC you could use 5.10's more efficient counterparts for $`,$',$& :

while ( $input =3D~ m/ \n /gpx ) {  # note /p switch: perldoc perlre

      my( $pre, $post ) =3D ( ${^PREMATCH}, ${^POSTMATCH} );

      unless ( $pre  =3D~ / $tokenRegExp $/x   and
               $post =3D~ / ^ $tokenRegExp /x  )
      {
	      substr($input, pos($input)-1, 1, "" );
      }
}

--
Charles DeRykus



------------------------------

Date: Thu, 11 Feb 2010 08:17:51 -0800 (PST)
From: "jl_post@hotmail.com" <jl_post@hotmail.com>
Subject: Re: How to do variable-width look-behind?
Message-Id: <f246f1f1-57f6-49bb-8c9a-1fb8e5f5a42c@g28g2000prb.googlegroups.com>

On Feb 9, 4:23=A0pm, "jl_p...@hotmail.com" <jl_p...@hotmail.com> wrote:
>
> I would like to be able to do something like this:
>
> =A0 =A0# Remove newlines that are not surrounded by valid tokens:
> =A0 =A0$input =3D~ s/(?<!$tokenRegExp)\n(?=3D$tokenRegExp)//g;
> =A0 =A0$input =3D~ s/(?<=3D$tokenRegExp)\n(?!$tokenRegExp)//g;
> =A0 =A0$input =3D~ s/(?<!$tokenRegExp)\n(?!$tokenRegExp)//g;
>
> =A0 =A0The problem is is that the look-behind assertions (both positive
> and negative) only work for fixed-width expressions, according to
> "perldoc perlre".


Ben Morrow replied:
> you could use  the usual solution for positive look-behind:
>
>     s/($tokenRegExp)\n(?=3D$tokenRegExp)/$1\0/g;
>     s/\n//g;
>     s/\0/\n/g;

   Ah.  Thanks for showing me.  Yes, that would work for my purposes,
as null bytes (and a whole lot of other characters) are guaranteed to
not appear in my $text.

   Before I had read your reply, I came up with a solution of my own.
It's not as simple as yours, but it did appear to work correctly:
First, I split the $text into an array of lines, and then looped
through every pair of lines.  If the a line does not end in the token
or its next line does not begin with the token, then I chomp() that
line.  Then I set $text to the lines.

   Here's essentially what I did:

   $text =3D do {
      my @lines =3D split m/(?<=3D\n)/, $text;

      foreach my $i (0 .. $#lines-1)
      {
         my ($current, $next) =3D @lines[$i, $i+1];

         # Skip removing newline if surrounded by tokens:
         next  if     $current =3D~ m/$tokenRegExp$/
                  and $next =3D~ m/^$tokenRegExp/;

         # This is a linebreak we want to remove:
         chomp($lines[$i]);
      }

      join '', @lines  # "returns" the lines into $text
   };

   It's not quite as elegant or simple as your solution, but it did
appear to work well.


Charles DeRykus replied:
> you could use 5.10's more efficient counterparts for $`,$',$& :
>
> while ( $input =3D~ m/ \n /gpx )
> {  # note /p switch: perldoc perlre
>       my( $pre, $post ) =3D ( ${^PREMATCH}, ${^POSTMATCH} );
>       unless ( $pre  =3D~ / $tokenRegExp $/x   and
>                $post =3D~ / ^ $tokenRegExp /x  )
>       {
>          substr($input, pos($input)-1, 1, "" );
>       }

   Oh, wow!  I never knew about the /p switch!  (That's especially odd
since I look up "perldoc perlre" fairly frequently.)  Thanks for
telling me about it.  Since the script I'm working on must run on
machines that aren't guaranteed to have Perl 5.10, I won't use it
right now, but I'll keep it in mind for scripts of my own use.

   Since $input is being modified inside its own while($input =3D~ m//g)
loop, I might suggest considering saving off pos($input) before the
substr() and then restoring it right after.  Otherwise, the while-
match will start back at the beginning of $input.

   (That might not be a problem in this case, but saving and restoring
the pos() at least ensures that the while-match loop won't revisit
parts of $input that have already been processed.)

   Anyway, thanks for your help and "$input", Ben and Charles.  I
appreciate it!

   -- Jean-Luc


------------------------------

Date: Wed, 10 Feb 2010 15:16:40 -0800 (PST)
From: Peng Yu <pengyu.ut@gmail.com>
Subject: Is a merge interval function available?
Message-Id: <877eb871-eefe-4125-a95b-03c46c84420e@q4g2000yqm.googlegroups.com>

I'm wondering there is already a function in perl library that can
merge intervals. For example, if I have the following intervals ('['
and ']' means closed interval as in
http://en.wikipedia.org/wiki/Interval_(mathematics)#Excluding_the_endpoints)

[1, 3]
[2, 9]
[10,13]
[11,12]

I want to get the following merged intervals.

[1,9]
[10,13]

Could somebody let me know if there is a function in the perl library?


------------------------------

Date: Thu, 11 Feb 2010 00:03:28 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Is a merge interval function available?
Message-Id: <g3ub47-7o02.ln1@osiris.mauzo.dyndns.org>


Quoth Peng Yu <pengyu.ut@gmail.com>:
> I'm wondering there is already a function in perl library that can
> merge intervals. For example, if I have the following intervals ('['
> and ']' means closed interval as in
> http://en.wikipedia.org/wiki/Interval_(mathematics)#Excluding_the_endpoints)
> 
> [1, 3]
> [2, 9]
> [10,13]
> [11,12]
> 
> I want to get the following merged intervals.
> 
> [1,9]
> [10,13]
> 
> Could somebody let me know if there is a function in the perl library?

A quick search of CPAN doesn't turn up anything obvious, but something
like this should work:

    use 5.010;
    use strict;
    use warnings;
    
    use List::Util qw/max/;

    sub closed_cover {
        my @in = sort {
            $a->[0] <=> $b->[0] ||
            $a->[1] <=> $b->[1]
        } @_;

        my $last = shift @in;
        my @out;

        while (my $next = shift @in) {
            if ($next->[0] <= $last->[1]) {
                $last->[1] = max $last->[1], $next->[1];
            }
            else {
                push @out, $last;
                $last = $next;
            }
        }

        return @out, $last;
    }

    say "[${$_}[0], ${$_}[1]]" for closed_cover
        [1, 3], [2, 9], [10, 13], [11, 12];

Ben



------------------------------

Date: Wed, 10 Feb 2010 12:05:01 -0800
From: sln@netherlands.com
Subject: Re: look up very large table
Message-Id: <5146n5d2dalgdcc4k2fonhnpjqts8i1gnf@4ax.com>

On Wed, 10 Feb 2010 18:57:02 +0800, "ela" <ela@yantai.org> wrote:

>I have some large data in pieces, e.g.
>
>asia.gz.tar 300M
>
>or
>
>roads1.gz.tar 100M
>roads2.gz.tar 100M
>roads3.gz.tar 100M
>roads4.gz.tar 100M
>
>I wonder whether I should concatenate them all into a single ultra large 
>file and then perform parsing them into a large table (I don't know whether 
>perl can handle that...).
>
>The final table should look like this:
>
[snip examples that doesen't tell info]

>Any advice? or should i resort to some other languages?
>

Yes, go back to the database that produced these files
and run a different querry to get the info you need.

If your not still with the company who owns this information,
I suggest you contact them for permission to use this information.

-sln


------------------------------

Date: Wed, 10 Feb 2010 19:22:23 -0800
From: Xho Jingleheimerschmidt <xhoster@gmail.com>
Subject: Re: look up very large table
Message-Id: <4b739c69$0$25660$ed362ca5@nr5-q3a.newsreader.com>

Jürgen Exner wrote:
> 
> However the real solution would be to load the whole enchilada into a
> database and then do whatever join you want to do. There is a reason why
> database system have been created and optimized for exactly such tasks.

Database systems are generally created for atomicity, concurrency, 
isolation, and durability, which is quite a bit more than this task 
seems to consist of.  It is my general experience that in this type of 
task, a Perl script could be written and have completed its job while 
the database system is still tying its shoes.


Xho


------------------------------

Date: Wed, 10 Feb 2010 19:14:40 -0800
From: Xho Jingleheimerschmidt <xhoster@gmail.com>
Subject: Re: look up very large table
Message-Id: <4b739c68$0$25665$ed362ca5@nr5-q3a.newsreader.com>

ela wrote:
> I have some large data in pieces, e.g.
> 
> asia.gz.tar 300M
> 
> or
> 
> roads1.gz.tar 100M
> roads2.gz.tar 100M
> roads3.gz.tar 100M
> roads4.gz.tar 100M

The data is first gzipped and then tarred?  That is an odd way of doing 
things.

> I wonder whether I should concatenate them all into a single ultra large 
> file 

I see no reason to do that.  Especially as I don't think tar format 
supports that cleanly, does it?

> and then perform parsing them into a large table (I don't know whether 
> perl can handle that...).

I bet it can.

> 
> The final table should look like this:
> 
> ID1  ID2  INFO
> X1   Y9     san diego; california; West Coast; America; North Ameria; Earth
> X2.3   H9     Beijing; China; Asia
> .....
> 
> each row may come from a big file of >100M (as aforementioned):
> 
> CITY    Beijing
> NOTE    Capital
> RACE    Chinese
> ....

What is the ",,,," hiding?  100M is an awful lot of "...."

Each file is turned into only one row?  And each file is 100M?  So how 
many rows do you anticipate having?


> And then I have another much smaller table which contains all the ID's 
> (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make 
> this 20M file annotated with the INFO. Hashing seems not to be a solution 
> for my 32G, 8-core machine...

Why not?

> Any advice? or should i resort to some other languages?

Your description is too vague to give any reasonable advice.


Xho


------------------------------

Date: Wed, 10 Feb 2010 23:04:09 -0800
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: look up very large table
Message-Id: <dka7n5p2conjvk7igqhsfbjarv7br1454n@4ax.com>

Xho Jingleheimerschmidt <xhoster@gmail.com> wrote:
>Jürgen Exner wrote:
>> 
>> However the real solution would be to load the whole enchilada into a
>> database and then do whatever join you want to do. There is a reason why
>> database system have been created and optimized for exactly such tasks.
>
>Database systems are generally created for atomicity, concurrency, 
>isolation, and durability, which is quite a bit more than this task 
>seems to consist of.  It is my general experience that in this type of 
>task, a Perl script could be written and have completed its job while 
>the database system is still tying its shoes.

Certainly true. But they are also designed to handle vast amounts of
data efficiently. And if the OP indeed runs into space issues then a DB
system may (just may!) provide an easier to use and even faster
alternative to looping through files over and over again. 

If it actually is an advantage or not is hard to tell. I agree, the OPs
task seems to be easy enough to be solved in a single pass. But the
description was rather cryptic, too, so there might be more
cross-references going on than either of us is expecting at this time.

jue


------------------------------

Date: Thu, 11 Feb 2010 11:43:03 -0700
From: Phred Phungus <Phred@example.invalid>
Subject: Re: shebang and ubuntu
Message-Id: <7tj1dpF71lU1@mid.individual.net>

Keith Thompson wrote:
> Phred Phungus <Phred@example.invalid> writes:
> [...]
>> dan@dan-desktop:~/source42$ chmod u + x t1.pl
>> chmod: invalid mode: `u'
>> Try `chmod --help' for more information.
>> dan@dan-desktop:~/source42$ chmod u+x t1.pl
>> dan@dan-desktop:~/source42$ ls -l
>> total 32
>> -rw-r--r-- 1 dan dan  2556 2010-02-07 18:46 b1.c
>> -rw-r--r-- 1 dan dan  2555 2010-02-07 18:46 b1.c~
>> -rwxr-xr-x 1 dan dan 13344 2010-02-07 18:47 out
>> -rwxr--r-- 1 dan dan   138 2010-02-08 01:34 t1.pl
>> -rw-r--r-- 1 dan dan    31 2010-02-08 01:30 t1.pl~
> [...]
> 
> Personally, I find the repeated occurrences of your rather long
> shell prompt districting.  They make it more difficult to read the
> actual information.  I'm sure I'm not the only one who thinks so.
> 
> I suggest shortening your prompt to something like "$ ".  You can
> do this by (carefully!) editing the output after you copy-and-paste
> it, but that risks losing information.  It's probably safer to
> change your prompt, run the commands, copy-and-paste the output,
> and then change it back.  (I'm not suggesting you should change the
> prompt for your own use, only for what you post here.)
> 
> For example, the above would be:
> 
> $ chmod u + x t1.pl
> chmod: invalid mode: `u'
> Try `chmod --help' for more information.
> $ chmod u+x t1.pl
> $ ls -l
> total 32
> -rw-r--r-- 1 dan dan  2556 2010-02-07 18:46 b1.c
> -rw-r--r-- 1 dan dan  2555 2010-02-07 18:46 b1.c~
> -rwxr-xr-x 1 dan dan 13344 2010-02-07 18:47 out
> -rwxr--r-- 1 dan dan   138 2010-02-08 01:34 t1.pl
> -rw-r--r-- 1 dan dan    31 2010-02-08 01:30 t1.pl~
> 
> which I find much easier to read.
> 
> The only useful part of the prompt is the current directory, but
> you could achieve that by using an explicit "cd" command.  (It's not
> really necessary here, since knowing that you're in ~/source42
> doesn't help us.)
> 

I think Keith's point goes to readability on usenet, which can stand a 
couple posts without disrupting anything.  I had also wanted to shorten 
that prompt, which one does by editting the .bashrc that appears in your 
home folder.  One replaces occurences of PS1 = ... to export PS1="\$ " 
and gets

$ echo "looking better?"
looking better?
$

-- 
fred




------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2814
***************************************


home help back first fref pref prev next nref lref last post