[31554] in Perl-Users-Digest
Perl-Users Digest, Issue: 2813 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed Feb 10 14:09:25 2010
Date: Wed, 10 Feb 2010 11:09:08 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Wed, 10 Feb 2010 Volume: 11 Number: 2813
Today's topics:
Re: can I get a new httpd.conf file <ben@morrow.me.uk>
Re: can I get a new httpd.conf file <john@castleamber.com>
Re: can I get a new httpd.conf file <john@castleamber.com>
Re: can I get a new httpd.conf file <smallpond@juno.com>
comparing lists <cartercc@gmail.com>
Re: comparing lists <jurgenex@hotmail.com>
Re: help with big numbers and DBI (hymie!)
Re: How to do variable-width look-behind? <ben@morrow.me.uk>
Re: look up very large table <jimsgibson@gmail.com>
Re: look up very large table <cartercc@gmail.com>
Re: look up very large table <john@castleamber.com>
Re: look up very large table <jurgenex@hotmail.com>
Re: Math not working <ben@morrow.me.uk>
Re: perl and sendmail speed problem <jcharth@gmail.com>
Re: perl and sendmail speed problem <anfi@onet.eu>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Wed, 10 Feb 2010 12:00:09 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: can I get a new httpd.conf file
Message-Id: <9nja47-m5u1.ln1@osiris.mauzo.dyndns.org>
Quoth John Bokma <john@castleamber.com>:
>
> Funny you should ask. You consider hiding a URL behind tinyurl.com not
> an insult?
Um, what? URL-shortening services are very useful on Usenet, given the
limited line length and the fact that line-broken URLs can't be
identified by some clients (or most if they've been quoted). In what way
is this insulting?
> I have more the feeling that you're trolling than anything else.
Now this, OTOH...
Ben
------------------------------
Date: Wed, 10 Feb 2010 10:10:44 -0600
From: John Bokma <john@castleamber.com>
Subject: Re: can I get a new httpd.conf file
Message-Id: <87zl3hqdff.fsf@castleamber.com>
Ben Morrow <ben@morrow.me.uk> writes:
> Quoth John Bokma <john@castleamber.com>:
>>
>> Funny you should ask. You consider hiding a URL behind tinyurl.com not
>> an insult?
>
> Um, what? URL-shortening services are very useful on Usenet, given the
> limited line length and the fact that line-broken URLs can't be
> identified by some clients (or most if they've been quoted). In what way
> is this insulting?
Maybe you should reread the thread?
Anyway, it doesn't matter, Tad agrees with me (for which my thanks).
--
John Bokma j3b
Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development
------------------------------
Date: Wed, 10 Feb 2010 10:19:57 -0600
From: John Bokma <john@castleamber.com>
Subject: Re: can I get a new httpd.conf file
Message-Id: <87vde5qd02.fsf@castleamber.com>
RedGrittyBrick <RedGrittyBrick@spamweary.invalid> writes:
> On 09/02/2010 16:41, John Bokma wrote:
>>
>> *But* 5+ regulars posting all the same knee jerking message is also
>> annoying.
>
> John, I do think you're over-reacting in this instance. I only count
> four responses to the OP, maybe the fifth didn't reach my news
> server. I'm sure you are aware of the propagation delays that can
> cause several people to respond before other responses become visible
> to them.
Hence my request to take a breath if you see an easy target. As for the
wording of those replies, IMO they can be a bit more friendly. Or maybe
that's just me.
Anyway, thanks for (still) reading me. Like I wrote earlier (and several
times in the past), this is one of the groups I consider hostile
compared to other groups I am subscribed to. And it's not because this
is a "techie" group, since I am subscribed to several of them. I've
subscribed to this group many times, and after some time I
unscribed. One of the major reasons was the hostile behavior (to which I
certainly have contributed in the past, hopefully way less now). Maybe
hostile, knee jerk, etc. are too strong words, but English is my second
language, and maybe that's showing here.
I understand that nobody is paid here for running a help desk (heh,
didn't that discussion run here ages ago ;-) ), and that people who come
here should at least be somewhat prepared. But I've made my newbie posts
as well, back in the day, and can't recall unfriendly replies back in
those days, maybe I forgot.
Maybe this thread didn't justify my reaction, but I am sure there are
threads here that do.
Thanks,
--
John Bokma j3b
Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development
------------------------------
Date: Wed, 10 Feb 2010 12:53:58 -0500
From: Steve C <smallpond@juno.com>
Subject: Re: can I get a new httpd.conf file
Message-Id: <hkurs3$hpm$1@news.eternal-september.org>
Myron wrote:
>>>> Somehow those posts attract 5+ regulars all stating nearly the same
>>>> message, and causing IMO noise that could've been prevented.
> this reminds me of a story:
> There was this guy that went into a grocery store and he askes for a
> mixer. The guy there says "That building acrost the street has them,
> you can look there." So the guy walks out of the store. But before he
> gets to the door, two guys, with oversize pants hanging off their butt
> and ball caps on backwards, walk up on each side of him and ask "What
> are you doing here? Don't you know this is a grocery store? You know
> that this is a grocery store, why are you bothering us about a pan?
> Pans aren't groceries. DUH."
>
> I am sure you wouldn't do this in a grocery store, you shouldn't do it
> here either.
>
> I would like to thank RedGrittyBrick again for directing me to the
> correct forum. Also I would like an appoligy from Tad McClellan for
> asserting that I was rude by asking a question in ignorance, and
> replying to his question.
>
> I am writing this in an effort to keep this group healthy so it don't
> go down in flames like many others. I haven't participated in
> newsgroups for about 10 years. I am using a Perl scrpt and will edit
> it in the future, and it would be nice to have someplace I could go
> for help.
Story I was thinking of was the cop who sees a drunk on his hands and
knees under the streetlight.
"What are you doing?"
"I lost my car keys over there."
"Well, why are you looking here?"
"The light's better"
------------------------------
Date: Wed, 10 Feb 2010 06:36:22 -0800 (PST)
From: ccc31807 <cartercc@gmail.com>
Subject: comparing lists
Message-Id: <99b9b51f-81f6-4fc1-9a28-cb2449ebc529@v25g2000yqk.googlegroups.com>
A normal task: sorting a large data file by some criterion, breaking
it into sub-files, and sending each sub-file to a particular client
based on the criterion.
During the next several weeks, I've been tasked with taking three data
files, comparing the keys of each file, and if the keys are identical,
processing the file but if not, printing out a list of differences,
which in effect means printing out the different keys. The keys are
all seven digit integers. (Each file is to be generated by a different
query of the same database.)
Okay, I could use diff for this, but I'd like to do it
programmatically. Using brute force, I could generate three files with
just the keys and compare them line by line, but I'd like not to do
this for several reason but mostly because the data files are pretty
much guaranteed to be identical and we don't expect there to be any
differences.
I'm thinking about hashing the keys in the three files and comparing
the key digests, with the assumption that identical hashes means
identical files.
Ideas?
Thanks, CC.
------------------------------
Date: Wed, 10 Feb 2010 06:59:01 -0800
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: comparing lists
Message-Id: <q5i5n5pkch9vjti9j7hhmfuj5ujp4vjp9m@4ax.com>
ccc31807 <cartercc@gmail.com> wrote:
>During the next several weeks, I've been tasked with taking three data
>files, comparing the keys of each file, and if the keys are identical,
>processing the file but if not, printing out a list of differences,
>which in effect means printing out the different keys. The keys are
>all seven digit integers. (Each file is to be generated by a different
>query of the same database.)
[...]
>I'm thinking about hashing the keys in the three files and comparing
>the key digests, with the assumption that identical hashes means
>identical files.
Seems to be rather simple and straight-forward. Read the keys from each
file into a hash (be careful to treat them as strings, such that you
don't run into potential int overflow problems), then compare the hashes
as described in "perldoc -q intersection".
jue
------------------------------
Date: Wed, 10 Feb 2010 14:39:26 GMT
From: hymie@lactose.homelinux.net (hymie!)
Subject: Re: help with big numbers and DBI
Message-Id: <yyzcn.35440$3W2.28131@newsfe14.iad>
In our last episode, the evil Dr. Lacto had captured our hero,
RedGrittyBrick <RedGrittyBrick@spamweary.invalid>, who said:
>On 09/02/2010 13:47, hymie! wrote:
>> I have an MSSQL database. I connect to it with SQL Server, run a
>> query "select content_id from log" with a "where" clause I'm not
>> allowed to broadcast, and I get these results:
>>
>> 1037479785592177191
>> 1037222160396202204
>> 1036993281177442875
>> 1037489555080390716
>> 1037253823299245752
>Transforming your Perl quesion into an MS-SQL question, can you do
>something like
> select cast(content_id as varchar(19)) as content_id from ...
Not being an MS-SQL guru in the slightest, I really appreciate this idea.
It appears to have solved my problem.
--hymie! http://lactose.homelinux.net/~hymie hymie@lactose.homelinux.net
-------------------------------------------------------------------------------
------------------------------
Date: Wed, 10 Feb 2010 12:17:04 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: How to do variable-width look-behind?
Message-Id: <0nka47-m5u1.ln1@osiris.mauzo.dyndns.org>
Quoth "jl_post@hotmail.com" <jl_post@hotmail.com>:
>
> I have a Perl script that processes multi-line input. The problem
> is, sometimes this input has newlines stuck in arbitrary places (such
> as right in the middle of a valid token). This makes the input out-of-
> spec, but I have no control over this, so I want to correct it if I
> can. What's more is, sometimes this newline breaks a token in two,
> where the first half still looks like a valid token while the other
> does not, and vice-versa.
>
> I'm trying to modify my Perl script so that it reviews every
> newline and see if it should be discarded. The logic I want to use is
> to throw out every newline UNLESS it is flanked (on both sides) by
> valid tokens. I would like to be able to do something like this:
>
> # Create a regular expression that matches tokens
> # like "N50E40", "N50 E40", "N5000 E4000",
> # "50N40E", "50N 40E", and "5000N4000E":
> my $tokenRegExp = qr/\b(?:[NS]\d+\s*[EW]\d+|\d+[NS]\s*\d+[EW])\b/;
>
> # Remove newlines that are not surrounded by valid tokens:
> $input =~ s/(?<!$tokenRegExp)\n(?=$tokenRegExp)//g; # no token
> before
> $input =~ s/(?<=$tokenRegExp)\n(?!$tokenRegExp)//g; # no token
> after
> $input =~ s/(?<!$tokenRegExp)\n(?!$tokenRegExp)//g; # no tokens
>
> The problem is is that the look-behind assertions (both positive
> and negative) only work for fixed-width expressions, according to
> "perldoc perlre". Unfortunately, it would be so useful for me to be
> able to match a string with a variable look-behind, that I'm hoping
> there's a logical work-around to this limitation.
>
> Is there any way for me to work around this limitation?
In this case I would recommend adding "\n" as a token, and when you find
you've just parsed TOKEN "\n" TOKEN go back and join those two tokens
together before processing them further. You will need to turn the "\n"
token back into whitespace (or whatever they should usually be treated
as) afterwards, of course.
Alternatively, if there is a character that is guaranteed not to appear
in your input ("\0" might be a good choice, or if your data is already
Unicode or you can afford the performance hit of upgrading everything
you could use one of the Unicode private-use characters) you could use
the usual solution for positive look-behind:
s/($tokenRegExp)\n(?=$tokenRegExp)/$1\0/g;
s/\n//g;
s/\0/\n/g;
If you have 5.10 or Regexp::Keep this can be simplified to
s/$tokenRegExp\K\n(?=$tokenRegExp)/\0/g;
s/\n//g;
s/\0/\n/g;
Ben
------------------------------
Date: Wed, 10 Feb 2010 09:36:25 -0800
From: Jim Gibson <jimsgibson@gmail.com>
Subject: Re: look up very large table
Message-Id: <100220100936252117%jimsgibson@gmail.com>
In article <hku3e0$3fs$1@ijustice.itsc.cuhk.edu.hk>, ela
<ela@yantai.org> wrote:
> I have some large data in pieces, e.g.
>
> asia.gz.tar 300M
>
> or
>
> roads1.gz.tar 100M
> roads2.gz.tar 100M
> roads3.gz.tar 100M
> roads4.gz.tar 100M
>
> I wonder whether I should concatenate them all into a single ultra large
> file and then perform parsing them into a large table (I don't know whether
> perl can handle that...).
There is no benefit that I can see to concatenating the files. Use the
File::Find module to find all files with a certain naming convention,
read each one, and process the information in each file. As far as the
amount of information that Perl can handles, that is mostly determined
by the available memory and how smart you are at condensing the data,
keeping only what you need and throwing away stuff you don't need.
>
> The final table should look like this:
>
> ID1 ID2 INFO
> X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
> X2.3 H9 Beijing; China; Asia
Perl does not have tables. It has arrays and hashes. You can nest
arrays and hashes to store complex datasets in memory by using
references.
> ....
>
> each row may come from a big file of >100M (as aforementioned):
>
> CITY Beijing
> NOTE Capital
> RACE Chinese
> ...
>
> And then I have another much smaller table which contains all the ID's
> (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
> this 20M file annotated with the INFO. Hashing seems not to be a solution
> for my 32G, 8-core machine...
>
> Any advice? or should i resort to some other languages?
Try reading all the files and saving the data you want. If you run out
of memory, then think about a different approach. 32GB of memory is
quite a lot.
If you can't fit all of your data into memory at one time, you might
consider using a database that will store your data in files. Perl has
support for many databases. But I would first determine whether or not
you can fit everything in memory.
--
Jim Gibson
------------------------------
Date: Wed, 10 Feb 2010 10:06:34 -0800 (PST)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: look up very large table
Message-Id: <29354f4f-f881-4f68-8fd0-410e8c745e8a@q16g2000yqq.googlegroups.com>
On Feb 10, 5:57=A0am, "ela" <e...@yantai.org> wrote:
> Any advice? or should i resort to some other languages?
Perl is probably your best bet for this task.
> I have some large data in pieces, e.g.
> asia.gz.tar 300M
> or
> roads1.gz.tar 100M
It might be helpful for you to give a sample of your data format. You
don't mention untarring and unzipping your file, so I assume that you
are dealing with ASCII text. If not, then some of the following might
not work well.
> I wonder whether I should concatenate them all into a single ultra large
> file and then perform parsing them into a large table (I don't know wheth=
er
> perl can handle that...).
Irrelevant question. Ordinarily you process files one line at a time,
so it doesn't make any difference how large a particular file is, as
long as each line can be manipulated. In cases where I have to deal
with a number of files, I find it easier to glob the files, or open
and read a directory, to automate the process of opening, reading, and
closing a number of files. You might gain something in particular
cases by combining files, but I don't see any general advantage in
doing so.
> each row may come from a big file of >100M (as aforementioned):
> CITY =A0 =A0Beijing
> NOTE =A0 =A0Capital
> RACE =A0 =A0Chinese
> ...
Typical data munging. Depending on whether you have duplicates, I
would probably build a hash and write the hash to your output file.
You then have be ability to sort on different fields, e.g., cities,
notes, races, etc.
> And then I have another much smaller table which contains all the ID's
> (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
> this 20M file annotated with the INFO. Hashing seems not to be a solution
> for my 32G, 8-core machine...
Hashing is ideal, provided you can link the two files by a common
record. The general technique is to open the ID file first, build a
hash of record IDs, then open your data file and populate the hash
records with data according to the common record. Then, open your
output file and print to it.
If you will use the data frequently, you might want to stuff the data
into a database so you can query it conveniently.
If you want help, please be sure to furnish both sample data from each
file and your attempts at writing the script.
CC.
------------------------------
Date: Wed, 10 Feb 2010 12:58:45 -0600
From: John Bokma <john@castleamber.com>
Subject: Re: look up very large table
Message-Id: <87wryk29zu.fsf@castleamber.com>
"ela" <ela@yantai.org> writes:
> I have some large data in pieces, e.g.
>
> asia.gz.tar 300M
>
> or
>
> roads1.gz.tar 100M
> roads2.gz.tar 100M
> roads3.gz.tar 100M
> roads4.gz.tar 100M
>
> I wonder whether I should concatenate them all into a single ultra large
> file and then perform parsing them into a large table (I don't know whether
> perl can handle that...).
>
> The final table should look like this:
>
> ID1 ID2 INFO
> X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
> X2.3 H9 Beijing; China; Asia
> ....
>
> each row may come from a big file of >100M (as aforementioned):
>
> CITY Beijing
> NOTE Capital
> RACE Chinese
> ...
>
> And then I have another much smaller table which contains all the ID's
> (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
> this 20M file annotated with the INFO. Hashing seems not to be a solution
> for my 32G, 8-core machine...
>
> Any advice? or should i resort to some other languages?
How about importing all your data into a database, and using SQL to
extract what you want? Depending on the format of your input files some
parsing might be required which can be done with a small Perl
program.
--
John Bokma j3b
Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development
------------------------------
Date: Wed, 10 Feb 2010 11:07:35 -0800
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: look up very large table
Message-Id: <9906n5h6ire6mtq1h73j2rctim23e85f8t@4ax.com>
"ela" <ela@yantai.org> wrote:
>I have some large data in pieces, e.g.
>
>asia.gz.tar 300M
>
>or
>
>roads1.gz.tar 100M
>roads2.gz.tar 100M
>roads3.gz.tar 100M
>roads4.gz.tar 100M
>
>I wonder whether I should concatenate them all into a single ultra large
>file
I may be mistaken but isn't that a prerequisite to actually extract any
data from compressed (.gz) file?
>and then perform parsing them into a large table (I don't know whether
>perl can handle that...).
The hardware is the limit.
>The final table should look like this:
>
>ID1 ID2 INFO
>X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
>X2.3 H9 Beijing; China; Asia
>....
>
>each row may come from a big file of >100M (as aforementioned):
>
>CITY Beijing
>NOTE Capital
>RACE Chinese
>...
>
>And then I have another much smaller table which contains all the ID's
>(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
>this 20M file annotated with the INFO. Hashing seems not to be a solution
>for my 32G, 8-core machine...
Depends. It's easy enough to do so you can just try if it works.
>Any advice? or should i resort to some other languages?
If at all you are hardware limited, Eventually the system will begin
swapping. And that will happen in any language if you try to keep too
much data in RAM.
If that happens you will have to revert to time-proven techniques from
the dark ages: trade HD space and time for RAM by keeping only one set
of data in RAM and annotate that set while processing the second set of
data from the HD line by line.
However the real solution would be to load the whole enchilada into a
database and then do whatever join you want to do. There is a reason why
database system have been created and optimized for exactly such tasks.
jue
------------------------------
Date: Wed, 10 Feb 2010 12:18:38 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Math not working
Message-Id: <upka47-m5u1.ln1@osiris.mauzo.dyndns.org>
Quoth Ilya Zakharevich <nospam-abuse@ilyaz.org>:
> On 2010-02-09, Ben Morrow <ben@morrow.me.uk> wrote:
> >> > Would you agree with me that new overload types
> >> > *must* default to falling back
>
> >> How would "new" types be different from the "old" ones? The problem
> >> existed back then; what changed?
>
> > What changed is that there are now published classes that use some
> > overloading, don't specify fallback, and don't overload the new type.
> > Take for example the new "qr" overload. Under 5.10 and earlier, treating
> > an object as a regex would invoke the stringify overload, so 5.12 must
> > continue to do so for objects that don't have a qr overload *even* if
> > fallback was not requested.
>
> Hmm, I deduce that under "overload types" you meant "overloaded
> operation"? If, yes, of course...
Yes, sorry, it was a poor choice of word. 'Type of overload' rather than
'type' as in 'class'.
Ben
------------------------------
Date: Wed, 10 Feb 2010 07:00:26 -0800 (PST)
From: joe <jcharth@gmail.com>
Subject: Re: perl and sendmail speed problem
Message-Id: <af273880-ccfb-4937-ade7-b43d7193a011@g27g2000yqh.googlegroups.com>
I think it has to be done individually. It is a email blast script for
our subscribers. I switched to net::smtp and I dont know if it will be
fast enough. If it is too slow I will try use a thread for every other
message.
------------------------------
Date: Wed, 10 Feb 2010 16:17:50 +0100
From: Andrzej Adam Filip <anfi@onet.eu>
Subject: Re: perl and sendmail speed problem
Message-Id: <h97ajsc4bj-A2A@john.huge.strangled.net>
joe <jcharth@gmail.com> wrote:
> I think it has to be done individually. It is a email blast script for
> our subscribers. I switched to net::smtp and I dont know if it will be
> fast enough. If it is too slow I will try use a thread for every other
> message.
If your perl supports threads then redirect your question to
news:comp.mail.sendmail to get "hints" how to configure sendmail
for "sky is the limit" performance :-)
In short: how to reconfigure sendmail to allow your script control
number of sendmail processes attempting "at once" deliveries
[ one sendmail process per one "perl thread"/"smtp session" ].
--
[pl>en Andrew] Andrzej Adam Filip : anfi@onet.eu : Andrzej.Filip@gmail.com
"Why are we importing all these highbrow plays like `Amadeus'? I could
have told you Mozart was a jerk for nothing."
-- Ian Shoales
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 2813
***************************************