[29104] in Perl-Users-Digest
Perl-Users Digest, Issue: 348 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Apr 16 11:14:16 2007
Date: Mon, 16 Apr 2007 08:14:10 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Mon, 16 Apr 2007 Volume: 11 Number: 348
Today's topics:
Re: looking for some size optimization (Marc Espie)
Re: looking for some size optimization (Marc Espie)
Re: looking for some size optimization <uri@stemsystems.com>
Re: perl.h seems to interfere with fopen or stdio.h <nospam-abuse@ilyaz.org>
Re: perl.h seems to interfere with fopen or stdio.h <wahab-mail@gmx.de>
Re: Top Turds of comp.lang.perl.misc (2007) <tadmc@augustmail.com>
Re: Top Turds of comp.lang.perl.misc (2007) <tadmc@augustmail.com>
Re: Top Turds of comp.lang.perl.misc (2007) <cwilbur@chromatico.net>
Re: Top Turds of comp.lang.perl.misc (2007) <cwilbur@chromatico.net>
Re: Top Turds of comp.lang.perl.misc (2007) <cwilbur@chromatico.net>
UTF16 input file to ISO-8859-1 output <rwood@therandymon.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Mon, 16 Apr 2007 12:28:08 +0000 (UTC)
From: espie@lain.home (Marc Espie)
Subject: Re: looking for some size optimization
Message-Id: <evvq4o$5ua$2@biggoron.nerim.net>
In article <58fhdeF2h00s8U1@mid.dfncis.de>,
<anno4000@radom.zrz.tu-berlin.de> wrote:
>Marc Espie <espie@nerim.net> wrote in comp.lang.perl.misc:
>> I'm looking at a script that handles a huge amount of data... basically,
>> the filenames from +4000 packages in order to recognize conflicts.
>>
>> Currently, it builds a big hash through a loop that constructs
>> $all_conflict like this:
>>
>> my $file= File::Spec->canonpath($self->fullname());
>> push ${$all_conflict->{$file}}, $pkgname;
>>
>>
>> I end up with a hash of 250M.
>
>No, you end up with a syntax error, unless the second code line is
>really
>
> push @{$all_conflict->{$file}}, $pkgname;
>
>I'll assume that, but please don't re-type code, copy/paste it to
>make sure such errors don't happen.
Oops, sorry about that. I usually copy&paste code indeed, and I could
have sworn I did.
>> I expect the $pkgname strings to be shared. In fact, I tried replacing
>
>What makes you expect that? You're pushing unrelated strings on
>arrays.
They're not necessarily unrelated.
I guess I'll have to share more code ;-) . I grab the full list
for each package, and populate the hash with that string like so:
$plist->visit('register', $filehash, $dephash, $plist->pkgname());
(not quoting my plist and visitor code, I assume you can figure out that
it will end up calling the above snippet for each item in the packing-list)
so I am certain that I call the registration routine with an identical
string for each item in a given package. I haven't looked too closely at
perl's internals, but I would have assumed it to share the string in such
a case ?
>The best storage strategy often depends on the nature of the data.
>In your case, I'll assume that actual conflicts are rare. That means
>the majority of your arrays of package names contain only one element.
>That is wasteful. You could use two hashes, one to detect if a file
>name has been seen before and another to keep information about actual
>conflicts. Here is some code. I assume that pairs of package names
>and file names can be read from DATA:
>
> my ( %seen, %conflicts);
> while ( <DATA> ) {
> my ( $package, $file) = split;
> if ( exists $seen{ $file} ) {
> $conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
> push @{ $conflicts{ $file} }, $package; # add another package
> } else {
> $seen{ $file} = $package; # mark as seen
> }
> }
This looks like a good idea indeed.
I'm wondering if maybe I should build a cache of pkgname lists,
and try to share as many as I can...
anyways, I'll try your idea (and the corresponding one) and tell you whether
I see any improvement.
While I'm there, responding to the other persons: sorry, my first message
wasn't clear. I do not want to go off to disk to a dbm or a database.
As it stands, the program works. It's just that 250M is over the default
ulimit of the considered system, which means that people have to remember
to bump the limits before running it. There's also the fact that the current
set of packages is bound to grow, so if I can find a good idea to reduce
memory usage, that would be cool. But using external storage is not the
solution I'm looking for.
------------------------------
Date: Mon, 16 Apr 2007 13:43:33 +0000 (UTC)
From: espie@lain.home (Marc Espie)
Subject: Re: looking for some size optimization
Message-Id: <evvui5$83d$1@biggoron.nerim.net>
In article <evvq4o$5ua$2@biggoron.nerim.net>,
Marc Espie <espie@nerim.net> wrote:
>In article <58fhdeF2h00s8U1@mid.dfncis.de>,
> <anno4000@radom.zrz.tu-berlin.de> wrote:
>> my ( %seen, %conflicts);
>> while ( <DATA> ) {
>> my ( $package, $file) = split;
>> if ( exists $seen{ $file} ) {
>> $conflicts{ $file} ||= [ $seen{ $file}]; # transfer first package
>> push @{ $conflicts{ $file} }, $package; # add another package
>> } else {
>> $seen{ $file} = $package; # mark as seen
>> }
>> }
>
>This looks like a good idea indeed.
>I'm wondering if maybe I should build a cache of pkgname lists,
>and try to share as many as I can...
Thanks for the insight. With a few lines of additional code, this does shrink
the process footprint from >250M to 190M. Very nice indeed.
Doesn't appear to be slower, the computation is IO-bound anyways (reading
packing-lists for gzip'ed archives).
The code now looks like:
my $file= File::Spec->canonpath($self->fullname());
if (exists $all_conflict->{$file}) {
$list->{$all_conflict->{$file}}->{$pkgname} ||=
[@{$all_conflict->{$file}}, $pkgname ];
$all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
} else {
$list->{$pkgname} ||= [$pkgname];
$all_conflict->{$file} = $list->{$pkgname};
}
I may try to tweak it a bit further, but it's already a very nice improvement.
(if somebody has other ideas, I'm still game ;-) )
Anyways, thanks a lot ;-)
(that's the ports/infrastructure/packages/find-all-conflicts script used in
OpenBSD, btw)
I think I'm going to use an extra seen array as well... I need to do some
measurements, but I suspect most of my files don't appear in more than
one package, so scanning the conflicts hash later probably can be sped up
by a large amount.
------------------------------
Date: Mon, 16 Apr 2007 10:51:37 -0400
From: Uri Guttman <uri@stemsystems.com>
Subject: Re: looking for some size optimization
Message-Id: <x7wt0c5q4m.fsf@mail.sysarch.com>
>>>>> "ME" == Marc Espie <espie@lain.home> writes:
ME> my $file= File::Spec->canonpath($self->fullname());
ME> if (exists $all_conflict->{$file}) {
ME> $list->{$all_conflict->{$file}}->{$pkgname} ||=
ME> [@{$all_conflict->{$file}}, $pkgname ];
that is slow as you copy the existing array back into another anon
array. andyou have a lot of code redundancy all over this. you always
are assigning an anon array but autoviviification will handle that for
you. put this before the if (and i am not even sure you need a
conditional there at all but i haven't followed the logic flow)
push( @{$all_conflict->{$file}}, $pkgname ;
that will replace the first line in both clauses and be much faster as
well (no wasteful extra copying). it might even save space as perl won't
be allocating and freeing as many buffers so less storage could be used.
ME> $all_conflict->{$file} = $list->{$all_conflict->{$file}}->{$pkgname};
ME> } else {
ME> $list->{$pkgname} ||= [$pkgname];
ME> $all_conflict->{$file} = $list->{$pkgname};
ME> }
as for the conflict hash, i am sure it can be reduced but i don't know
the logic. you change $all_conflict->{$file} after each push (your ||=
code) which makes no sense to me. maybe you should clearly explain the
data structure you want to get out of this. i have yet to see such an
explanation in this thread (or i am not awake yet). i can't see how 4000
entries of maybe a few hundred bytes each will use up 250MB (or even 190).
ME> (that's the ports/infrastructure/packages/find-all-conflicts
ME> script used in OpenBSD, btw)
any url to get that directly? if that is published code then my autoviv
fix will save tons of time for many users. that copy anon arrays to
themselves thing is massively bad code. for more on autovivification see
my article at http://sysarch.com/Perl/autoviv.txt.
knowing that the build code is poorly designed, now i am confident that
the data structure is also poorly design and can be majorly optimized.
uri
--
Uri Guttman ------ uri@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
------------------------------
Date: Mon, 16 Apr 2007 11:30:45 +0000 (UTC)
From: Ilya Zakharevich <nospam-abuse@ilyaz.org>
Subject: Re: perl.h seems to interfere with fopen or stdio.h
Message-Id: <evvmp5$p75$1@agate.berkeley.edu>
[A complimentary Cc of this posting was sent to
Mirco Wahab
<wahab-mail@gmx.de>], who wrote in article <evu9pj$bnd$1@mlucom4.urz.uni-halle.de>:
> > One is expected to not work with StdIO under influence of perl.h -
> > these calls are redirected to PerlIO instead. Use separate
> > compilation units for the stuff which uses StdIO and perl.h.
>
> After reading this, I waded through embed.h and perl.h
> but couldn't find anything related (one spot mentioned
> a broken fflush on Solaris).
>
> I have some programs which read/write through stdio
> under perl.h (which is included *after* stdio) in the
> same compilation unit. (Win32, console program, Visual
> Studio/98 and /2005).
Why do you think they go through stdio? Did you check the names in
the object file? I would expect them work through PerlIO (unless some
magic #define is done); but this might have been changed in recent
Perls - AFAIU, this was just a protective measure during transition to
PerlIO (to break things coded to a wrong standard as early as
possible).
> Can you help me out here in what exactly is
> the weak point?
I do not know anything about any "weak point". With gcc, one can see
redefinitions via -E -dD.
Hope this helps,
Ilya
------------------------------
Date: Mon, 16 Apr 2007 15:31:44 +0200
From: Mirco Wahab <wahab-mail@gmx.de>
Subject: Re: perl.h seems to interfere with fopen or stdio.h
Message-Id: <evvu8f$qdq$1@mlucom4.urz.uni-halle.de>
Ilya Zakharevich wrote:
> Mirco Wahab wrote in article <evu9pj$bnd$1@mlucom4.urz.uni-halle.de>:
>> I have some programs which read/write through stdio
>> under perl.h (which is included *after* stdio) in the
>> same compilation unit. (Win32, console program, Visual
>> Studio/98 and /2005).
>
> Why do you think they go through stdio? Did you check the names in
> the object file? I would expect them work through PerlIO (unless some
> magic #define is done); but this might have been changed in recent
> Perls - AFAIU, this was just a protective measure during transition to
> PerlIO (to break things coded to a wrong standard as early as
> possible).
You are right, I missed some #defines at one point,
and checked the application now.
Plain stdio calls like fopen *do go* through the perl
layer, under conditions mentioned above, its win32_fopen()
from /win32/win32.c which contains the replacement.
But that layer seems rather thin ( #define PERLIO_NOT_STDIO 0 ),
it does some checks and goes straight into something denoted
by [__imp__fopen(123)] (Win32?) then (at least here,
stock 32bit Activeperl/820).
Thanks & Regards
Mirco
------------------------------
Date: Mon, 16 Apr 2007 07:06:11 -0500
From: Tad McClellan <tadmc@augustmail.com>
Subject: Re: Top Turds of comp.lang.perl.misc (2007)
Message-Id: <slrnf26plj.50e.tadmc@tadmc30.august.net>
cartercc@gmail.com <cartercc@gmail.com> wrote:
> On Apr 15, 6:39 pm, Tad McClellan <t...@augustmail.com> wrote:
>> Unless you want to be heard.
Unless what?
The statement is senseless without its context.
> Lot less chance of being heard if you post your pithy two lines after
> quoting 500 lines of someone else's drivel.
That's true, but why do you bring that up?
I did not advocate bottom posting.
> If your post is clear without reference to a prior post, then you
> SHOULD top post.
If it is unrelated to other posts in the thread, then it should
be in a new thread.
> If your post needs a reference to other material for
> clarity, the quote only what you need to.
Exactly so.
> Frankly, I find
> your insistence that everything be bottom posted
I have never insisted that! Can you cite even once where I have?
I have insisted that you post in the accepted manner, quote, trim
and interleave.
> If the consequence of my top posting means that you
> won't be reading my posts, I can live with that.
I expect so, but it isn't just me. It is the way preferred by
everybody here (and nearly everybody elsewhere).
If you can live with everybody not reading your posts, what is
the point of writing them?
> I don't need you
> telling me how to post on usenet.
Your insistence on ignoring the social moors of your audience
has become tedious.
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
------------------------------
Date: Mon, 16 Apr 2007 07:43:08 -0500
From: Tad McClellan <tadmc@augustmail.com>
Subject: Re: Top Turds of comp.lang.perl.misc (2007)
Message-Id: <slrnf26rqs.53d.tadmc@tadmc30.august.net>
cartercc@gmail.com <cartercc@gmail.com> wrote:
> I don't need you
> telling me how to post on usenet.
http://en.wikipedia.org/wiki/When_In_Rome
This is Rome.
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
------------------------------
Date: 16 Apr 2007 10:19:50 -0400
From: Charlton Wilbur <cwilbur@chromatico.net>
Subject: Re: Top Turds of comp.lang.perl.misc (2007)
Message-Id: <87lkgs5rll.fsf@mithril.chromatico.net>
>>>>> "EJ" == Ed Jay <edMbj@aes-intl.com> writes:
EJ> How profoundly rude, indeed! ...
>> I recommend, first, that you pay attention to the Posting
>> Guidelines, and pay attention to how many people who follow the
>> Posting Guidelines get flamed and treated rudely compared to
>> how many people who do not follow them. There's a reason they
>> exist.
EJ> It appears to me that you are using an excuse to substitute
EJ> for a reason why it's OK to be rude. Your point is well made,
EJ> but I still see no good reason for anyone to be rude or
EJ> offensive. Difference of opinion.
Exactly. I see no good reason for anyone to be rude or offensive; the
Posting Guidelines define, at least for the knowledgeable people in
this group, what constitutes rude and offensive behavior.
If you come in here and ask for the documentation to be read to you,
you're being incredibly rude. If you come in here babbling with 'u'
and 'ur' and 'y' and 'ne1' in place of English words, you're being
incredibly rude.
The regulars are just responding in kind, in a form the original
poster is likely to actually recognize as rudeness.
Charlton
--
Charlton Wilbur
cwilbur@chromatico.net
------------------------------
Date: 16 Apr 2007 10:27:08 -0400
From: Charlton Wilbur <cwilbur@chromatico.net>
Subject: Re: Top Turds of comp.lang.perl.misc (2007)
Message-Id: <87hcrg5r9f.fsf@mithril.chromatico.net>
>>>>> "PJH" == Peter J Holzer <hjp-usenet2@hjp.at> writes:
PJH> When I was at the university, there wasn't a single course in
PJH> the curriculum which was touted as a "language course". Of
PJH> course you learned Modula II in the "introduction to
PJH> programming" course and you learned C in the "systems
PJH> programming" course for the simple reason that these were the
PJH> languages which you had to use for the exercises, but that
PJH> wasn't the actual goal of the course, and the choice of
PJH> language certainly wasn't that the language should be in
PJH> wide-spread use in "the industry".
Likewise. In my case, one learned Pascal as the introductory
language, and that was the end of formal language instruction. And
that was principally a matter of convenience, because Pascal was
sufficiently powerful to teach the basic concepts and sufficiently
expressive that a Pascalish pseudocode could be used to express
algorithms and data structures.
There was also a course in programming languages, where the semester
consisted of learning the rudiments of 8 programming languages besides
Pascal and then talking about how the odd features of each programming
language affected things like parsing the language, compiling,
tradeoffs between things that could be determined at compile-time
versus things that could be determined at run-time, early binding
versus late binding, and so on.
And in a few of the courses, there was an admonition that sample code
and library code would be provided in a particular language (for
Artificial Intelligence, Lisp; for Software Engineering, C++; for
Parallel Computation, ML) and that although no particular language was
required in the coursework, students wishing to perform acceptably
well in those courses ought to familiarize themselves with the
language in the first week or so of the semester even if they intended
to work in other languages and environments.
This did not do good things for my resume; I cannot point and say
"Look, I took a college class in C++." On the other hand, within a
week of starting my current employment, I was debugging ColdFusion
errors that people who had had Official Macromedia Training Courses
could not figure out, so it must have done some good.
Charlton
--
Charlton Wilbur
cwilbur@chromatico.net
------------------------------
Date: 16 Apr 2007 10:38:06 -0400
From: Charlton Wilbur <cwilbur@chromatico.net>
Subject: Re: Top Turds of comp.lang.perl.misc (2007)
Message-Id: <87d5245qr5.fsf@mithril.chromatico.net>
>>>>> "MD" == Michele Dondi <bik.mido@tiscalinet.it> writes:
MD> On 15 Apr 2007 07:00:35 -0400, Charlton Wilbur
MD> <cwilbur@chromatico.net> wrote:
>> To call yourself a computer programmer, you don't even need to
>> be able to operate the computer. You can get hired and paid as
>> a computer programmer with no qualification whatsoever, and you
>> can continue in the job regardless of skill.
MD> Well, isn't this the nice part of this world? Seriously, I
MD> think it is.
When I think about recreational and research programming, and the fact
that someone with no credentials but a lot of skill and knowledge can
do well.
When I think about the past few places I've worked, where the hiring
managers could not discern between a competent programmer and (as near
as I could tell) a potted plant, and the potted plant frequently got
hired instead of the competent programmer, I'm not so sure.
When I was an undergraduate, it was rapidly apparent to me that 2/3 of
the people in the world were incompetent and either unaware of that
fact or too lazy to do anything about it; the other 1/3 of people were
doing three times as much work as they needed to to make up for the
other 2/3, plus a healthy dose of undoing poorly done things. When I
got into the IT world, it seemed like I underestimated by an order of
magnitude. Requiring credentials and certification -- at least
*meaningful* credentials and certification, and there's a whole other
can of worms -- or making software engineers legally and
professionally liable for things they approve of, in the way that
engineers in the physical world are legally and professionally liable,
would go a long way.
>> If you do the wrong thing as a computer scientist when you
>> should have known otherwise, you get promoted to management and
>> given a bigger budget.
MD> $Quote->To('.sig');
Alas, it seems as though my principal mark on the world will be as a
creator of .sig-worthy aphorisms.
Charlton
--
Charlton Wilbur
cwilbur@chromatico.net
------------------------------
Date: Mon, 16 Apr 2007 13:56:37 -0000
From: R Wood <rwood@therandymon.com>
Subject: UTF16 input file to ISO-8859-1 output
Message-Id: <132704lh8mu8aea@corp.supernews.com>
All -
I came up with what I thought would be a fun project - a little Perl script
to take a Vcard input file and parse it into a mutt alias file. For
non-mutt users, a mutt alias file looks like the following:
alias ALIASNAME Firstname Lastname <email@example.com>
and I've standardized on ALIASNAMEs that look like Firstname_Lastname.
So it's just a couple of searches and print statements. It took me awhile
since the last time I programmed anything was Pascal back in college ('93),
but I was finally able to put together the following program. It works
great with one exception: some of my vcards have diacritical marks in them
for Spanish/French names, so Apple's Addressbook spits out the vcard file in
UTF16, which my little Perl script promptly ignores.
I Googled Perl UTF8 and read the relevant chapter in the Llama, and read
perlrun and perlunicode as well, but at this point I am frankly over my
head. How can I embed something into this script that will make perl
understand it will receive UTF16 characters?
I want a file that I can use in iso-8859-1 (i.e. my Linux box). I know
there are modules that deal with Vcards and have spotted a couple of them on
CPAN, but the UTF16 really messes me up. Binmode doesn't seem to do it for
me. I'm using Perl 5.8.6, which I see
has better UTF support, but I've still got a long way to go. Pointers will
be gratefully accepted. Once I get this straightened out the next challenge
is what to do for folks that have more than one email address.
Randy
PS for what it's worth, this was fun, but I am clearly over my head at this
point.
#!/usr/bin/perl -Tw
# V2M - Vcard to Mutt utility
# This little perl script reads in a vcard file generated by Apple's
# Addressbook (and hopefully other Vcard files as well, though
# frankly I can't be bothered to check)
# and converts them into a Mutt alias file.
#
#
# a mutt alias file looks like the following:
# alias First_LastName Whatever the Name Is <address@example.foobar>
use utf8;
my $mail;
my $dashname;
my $realname;
while ( <> )
{
if (/^FN:(.*)/i) # find a name - everything after FN:
{
s/FN://g;
s/\s$//g;
$realname = $_;
s/\s/_/g;
$dashname = $_;
print "alias $dashname $realname ";
}
elsif (/:{1}([^\s()\[\]{}@,]*
@
[^\s()\[\]{}@,.]{1}
[^\s()\[\]{}@,]*
\.
[a-z0-9]{2,4})/xi) # find an email address
{
$mail = $1;
print "<$mail>\n";
}
}
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 348
**************************************