[29107] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 351 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Apr 16 21:14:18 2007

Date: Mon, 16 Apr 2007 18:14:12 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Mon, 16 Apr 2007     Volume: 11 Number: 351

Today's topics:
    Re: looking for some size optimization <nospam-abuse@ilyaz.org>
    Re: looking for some size optimization <uri@stemsystems.com>
    Re: looking for some size optimization <uri@stemsystems.com>
    Re: looking for some size optimization (Marc Espie)
    Re: looking for some size optimization (Marc Espie)
    Re: looking for some size optimization <uri@stemsystems.com>
        Socket creation failing with "operation now in progress google@macrotex.net
    Re: UTF16 input file to ISO-8859-1 output <n.pontikos@laconic.com>
    Re: UTF16 input file to ISO-8859-1 output <jurgenex@hotmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Mon, 16 Apr 2007 22:06:49 +0000 (UTC)
From:  Ilya Zakharevich <nospam-abuse@ilyaz.org>
Subject: Re: looking for some size optimization
Message-Id: <f00s1p$1inh$1@agate.berkeley.edu>

[A complimentary Cc of this posting was sent to
Marc Espie
<espie@nerim.net>], who wrote in article <f00b1i$dmi$1@biggoron.nerim.net>:
> >  ME>         my $file= File::Spec->canonpath($self->fullname());
> >  ME>         if (exists $all_conflict->{$file}) {
> >  ME>                 $list->{$all_conflict->{$file}}->{$pkgname} ||=
> >  ME>                         [@{$all_conflict->{$file}}, $pkgname ];

> We are talking 4000 packages. Which contain about 700000 files, total.

So you want a hash with 700000 entries?  With 100b/entry overhead, it
will take about least 70K + size of entries.

You did not explain what is the distribution of lengths of arrays
which are values of the hash.  E.g, if you have a lot of singletons,
it would save space to store them directly in the hash (the price
being an extra check when adding a new entry).

You may also win by enumerating the packages, and storing the index:

  my $idx = $seen{$pkgname};
  $idx = $seen{$pkgname} = ++$max_pk_idx, $pk[$idx] = $pkgname
      unless defined $idx;
  push @{$all_conflict->{$file}}, 0 + $idx;	# Cvt to "pure-number"

Also, take care not to stringify $all_conflict->{$file}[$n], copy it
to a scratch variable if you ever need the string value of it...

Hope this helps,
Ilya


------------------------------

Date: Mon, 16 Apr 2007 18:25:13 -0400
From: Uri Guttman <uri@stemsystems.com>
Subject: Re: looking for some size optimization
Message-Id: <x74png554m.fsf@mail.sysarch.com>

>>>>> "ME" == Marc Espie <espie@lain.home> writes:

  >> that is slow as you copy the existing array back into another anon
  >> array. andyou have a lot of code redundancy all over this. you always
  >> are assigning an anon array but autoviviification will handle that for
  >> you. put this before the if (and i am not even sure you need a
  >> conditional there at all but i haven't followed the logic flow)
  >> 
  >> push( @{$all_conflict->{$file}}, $pkgname ;

  ME> But still, the new structure takes less space than the old one.
  ME> The trick is that there is one single ref for each pkgname set now.

that may be the case but the intermediate storage will be dramatically
reduced too with my code.

  ME> The simple line overhead will build one separate entry for each file
  ME> and this looks like it is the stuff gobbling memory.

i will take a look at that when i can.

  >> as for the conflict hash, i am sure it can be reduced but i don't know
  >> the logic. you change $all_conflict->{$file} after each push (your ||=
  >> code) which makes no sense to me. maybe you should clearly explain the
  >> data structure you want to get out of this. i have yet to see such an
  >> explanation in this thread (or i am not awake yet). i can't see how 4000
  >> entries of maybe a few hundred bytes each will use up 250MB (or even 190).

  ME> We are talking 4000 packages. Which contain about 700000 files, total.

i didn't see the 700k files part anywhere before. the dataset wasn't
defined for me.

  >> knowing that the build code is poorly designed, now i am confident that
  >> the data structure is also poorly design and can be majorly optimized.

  ME> Go ahead, knock yourself out...

  ME> You can get this through OpenBSD's cvsweb, from openbsd.org.
  ME> http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

is this your program for the bsd distro? or are you trying to improve it
for them?

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org


------------------------------

Date: Mon, 16 Apr 2007 19:21:40 -0400
From: Uri Guttman <uri@stemsystems.com>
Subject: Re: looking for some size optimization
Message-Id: <x7vefv52ij.fsf@mail.sysarch.com>

>>>>> "ME" == Marc Espie <espie@lain.home> writes:

  ME> In article <x7wt0c5q4m.fsf@mail.sysarch.com>,
  ME> Uri Guttman  <uri@stemsystems.com> wrote:
  >> 
  >> any url to get that directly? if that is published code then my autoviv
  >> fix will save tons of time for many users. that copy anon arrays to
  >> themselves thing is massively bad code. for more on autovivification see
  >> my article at http://sysarch.com/Perl/autoviv.txt.

  >> knowing that the build code is poorly designed, now i am confident that
  >> the data structure is also poorly design and can be majorly optimized.

  ME> Go ahead, knock yourself out...

  ME> You can get this through OpenBSD's cvsweb, from openbsd.org.
  ME> http://www.openbsd.org/cgi-bin/cvsweb/ports/infrastructure/package/find-all-conflicts

a quick scan shows many places to improve that script, noticeably in
speed. one of your change logs comments on 5% speedup and i can easily
get more as you are still doing that anon array copy stuff.

it would help to have access to the package data and some functional
specs. i will still do a basic code review here in a day or so if the
time works out. i don't understand the logic or goals enough to do a
deeper review as in shrinking the data structures which is the big
issue.

do you have any descriptions of the package info and what defines a
package collision? this is something which should be in the
docs/pod/comments of the script.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org


------------------------------

Date: Mon, 16 Apr 2007 23:41:23 +0000 (UTC)
From: espie@lain.home (Marc Espie)
Subject: Re: looking for some size optimization
Message-Id: <f011j3$nlc$1@biggoron.nerim.net>

In article <x7wt0c5q4m.fsf@mail.sysarch.com>,
Uri Guttman  <uri@stemsystems.com> wrote:
>any url to get that directly? if that is published code then my autoviv
>fix will save tons of time for many users. that copy anon arrays to
>themselves thing is massively bad code. for more on autovivification see
>my article at http://sysarch.com/Perl/autoviv.txt.
>
>knowing that the build code is poorly designed, now i am confident that
>the data structure is also poorly design and can be majorly optimized.

Please explain to me how this is massively bad code. I know about
auto-vivification. I don't see any unintended auto-vivification in my
data structure.

Also, when you talk about copying anon arrays to themselves, I assume you're
referring to code like:
                $pkg_list->{$all_conflict->{$file}}->{$pkgname} ||=
                    [@{$all_conflict->{$file}}, $pkgname ];
                $all_conflict->{$file} =
                    $pkg_list->{$all_conflict->{$file}}->{$pkgname};

but it doesn't intend to copy anon arrays to themselves. The idea
is that for each set of pkgname1, pkgname2, pkgname3, I want to
have one single ref to an array with [pkgname1, pkgname2, pkgname3].

Assuming I have initially $all_conflict->{$file} corresponding to
$r = [$pkgname1, $pkgname2], then I will try to build
$pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
unique new reference.

I can use references as keys in hashes, right ?


------------------------------

Date: Mon, 16 Apr 2007 23:51:59 +0000 (UTC)
From: espie@lain.home (Marc Espie)
Subject: Re: looking for some size optimization
Message-Id: <f0126v$o3s$1@biggoron.nerim.net>

In article <x7vefv52ij.fsf@mail.sysarch.com>,
Uri Guttman  <uri@stemsystems.com> wrote:
>it would help to have access to the package data and some functional
>specs. i will still do a basic code review here in a day or so if the
>time works out. i don't understand the logic or goals enough to do a
>deeper review as in shrinking the data structures which is the big
>issue.

It's probably impractical to give you access to the package data. A full
package build is slightly over 5G in size these days....

>do you have any descriptions of the package info and what defines a
>package collision? this is something which should be in the
>docs/pod/comments of the script.

The documentation is scattered in other parts of the tools. Basic conflict
information is documented in OpenBSD::PkgCfl(3p)
The packing-list format is partly documented in pkg_create(1), partly
in OpenBSD::PackingElement(3p). The format of package specifications is
explained in package_specs(7) (you can get all this through the cvsman
interface in OpenBSD).

These are somewhat big tools, and still evolving. The part that's more or
less stable is documented...

As far as find-all-conflicts goes, there are two passes: one that
builds a correspondance between each file-like object in all packages,
and the corresponding pkgnames that contain this object, and a second
pass that checks for each resulting list of pkgnames whether there is
an actual collision. Some packages can not be installed at the same time,
as they conflict (they usually are variant versions of the same software),
so it's okay for the same file to be present in both packages. The situation
is slightly more complex, because each package actually pulls a dependency
tree, so we have to look at the closure of packages through dependencies:
even if a file belongs to 2 packages that do not conflict directly, there
might be a conflict deeper in the dependency chain, which results in both
packages not being installable simultaneously anyways.

In the end, the script reports the (very small) number of files which happen
to collide, but that no-one is yet aware of. As a result, we decide whether
we want to change the file names, or whether the conflict should be
explicitly registered...


------------------------------

Date: Mon, 16 Apr 2007 20:30:05 -0400
From: Uri Guttman <uri@stemsystems.com>
Subject: Re: looking for some size optimization
Message-Id: <x7odln4zci.fsf@mail.sysarch.com>

>>>>> "ME" == Marc Espie <espie@lain.home> writes:

  ME> In article <x7wt0c5q4m.fsf@mail.sysarch.com>,
  ME> Uri Guttman  <uri@stemsystems.com> wrote:
  >> any url to get that directly? if that is published code then my autoviv
  >> fix will save tons of time for many users. that copy anon arrays to
  >> themselves thing is massively bad code. for more on autovivification see
  >> my article at http://sysarch.com/Perl/autoviv.txt.
  >> 
  >> knowing that the build code is poorly designed, now i am confident that
  >> the data structure is also poorly design and can be majorly optimized.

  ME> Please explain to me how this is massively bad code. I know about
  ME> auto-vivification. I don't see any unintended auto-vivification in my
  ME> data structure.

it doesn't autovivify at all is my point. 

  ME> Also, when you talk about copying anon arrays to themselves, I assume you're
  ME> referring to code like:
  ME>                 $pkg_list->{$all_conflict->{$file}}->{$pkgname} ||=
  ME>                     [@{$all_conflict->{$file}}, $pkgname ];
  ME>                 $all_conflict->{$file} =
  ME>                     $pkg_list->{$all_conflict->{$file}}->{$pkgname};

  ME> but it doesn't intend to copy anon arrays to themselves. The idea
  ME> is that for each set of pkgname1, pkgname2, pkgname3, I want to
  ME> have one single ref to an array with [pkgname1, pkgname2, pkgname3].

ok, then you can make a better data structure for that. i still don't
get the conflict logic as it isn't spelled out. 

  ME> Assuming I have initially $all_conflict->{$file} corresponding to
  ME> $r = [$pkgname1, $pkgname2], then I will try to build
  ME> $pkg_list->{$r}->{$pkgname3} = [$pkgname1, $pkgname2, $pkgname3] as a
  ME> unique new reference.

  ME> I can use references as keys in hashes, right ?

yes, you can do that. 

i need to grok the conflict logic. my gut says there is a much better
structure out there and i trust it well.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org


------------------------------

Date: 16 Apr 2007 17:17:20 -0700
From: google@macrotex.net
Subject: Socket creation failing with "operation now in progress" error
Message-Id: <1176769040.497931.272380@q75g2000hsh.googlegroups.com>

I am attempting to establish a socket connection using the
IO::Socket::INET module and occasionally the socket creation fails
with the message "Operation now in progress". From what I understand,
this error can only happen when attempting to create a socket in non-
blocking mode, never in blocking mode. However, I have explicitly set
the Blocking parameter to 1.

So, my question is: what could be causing this error? Does
IO::Socket::INET ignore the Blocking parameter?

The OS is 64-bit Redhat Enterprise Linux 4; the version of perl is
5.8.5; the IO::Socket::INET is version 1.27.

Here is the portion of code where the problem happens
########################
$socket = new IO::Socket::INET(
                                 PeerAddr  => my.ip.address.com,
                                 PeerPort  => 5000,
                                 Proto     => 'tcp',
                                 Type      => SOCK_STREAM,
                                 Timeout   => 30,
                                 Blocking  => 1,
                                  ) ;
  }

  # Did the socket get created correctly?
  if (!$socket)
  {
    LogError("Socket to $IP could not be created. "
           . "Reason: $!") ;
    exit
  }
##########################



------------------------------

Date: Mon, 16 Apr 2007 20:32:27 +0100
From: nikolas pontikos <n.pontikos@laconic.com>
Subject: Re: UTF16 input file to ISO-8859-1 output
Message-Id: <f00hd0$jk7m$1@uns-a.ucl.ac.uk>

R Wood wrote:
> All - 
> 
> I came up with what I thought would be a fun project - a little Perl script
> to take a Vcard input file and parse it into a mutt alias file.  For
> non-mutt users, a mutt alias file looks like the following:
> 
> alias ALIASNAME Firstname Lastname <email@example.com>
> 
> and I've standardized on ALIASNAMEs that look like Firstname_Lastname.
> 
> So it's just a couple of searches and print statements.  It took me awhile
> since the last time I programmed anything was Pascal back in college ('93),
> but I was finally able to put together the following program.  It works
> great with one exception: some of my vcards have diacritical marks in them
> for Spanish/French names, so Apple's Addressbook spits out the vcard file in
> UTF16, which my little Perl script promptly ignores.
> 
> I Googled Perl UTF8 and read the relevant chapter in the Llama, and read
> perlrun and perlunicode as well, but at this point I am frankly over my
> head.  How can I embed something into this script that will make perl
> understand it will receive UTF16 characters?  
> 
> I want a file that I can use in iso-8859-1 (i.e. my Linux box).  I know
> there are modules that deal with Vcards and have spotted a couple of them on
> CPAN, but the UTF16 really messes me up.  Binmode doesn't seem to do it for
> me.  I'm using Perl 5.8.6, which I see 
> has better UTF support, but I've still got a long way to go.  Pointers will
> 
> be gratefully accepted.  Once I get this straightened out the next challenge
> is what to do for folks that have more than one email address. 
> 
> Randy
> PS for what it's worth, this was fun, but I am clearly over my head at this
> point.
> 
> #!/usr/bin/perl -Tw
> 
> #     V2M - Vcard to Mutt utility
> #	This little perl script reads in a vcard file generated by Apple's 
> #	Addressbook (and hopefully other Vcard files as well, though
> #	frankly I can't be bothered to check)
> #	and converts them into a Mutt alias file.
> #
> #
> #  a mutt alias file looks like the following:
> # alias First_LastName Whatever the Name Is <address@example.foobar>
> 
>  use utf8;
> 
>  my $mail;
>  my $dashname;
>  my $realname;
> 
> while ( <> )
> {
> 	if  (/^FN:(.*)/i) 		# find a name - everything after FN:
> 	{
> 	s/FN://g;
> 	s/\s$//g;
> 	$realname = $_;
> 
> 	s/\s/_/g;
> 	$dashname = $_;
> 	print "alias $dashname $realname ";
> 	}
> 
> 	elsif (/:{1}([^\s()\[\]{}@,]*
>                   @
>                   [^\s()\[\]{}@,.]{1}
>                   [^\s()\[\]{}@,]*
>                   \.
>                   [a-z0-9]{2,4})/xi)		# find an email address
> 	{
> 		$mail = $1;
> 		print "<$mail>\n";
> 	}
> 
> }

Sorry I can't really help you with the perl as I am myself a newbie.

About the issue of character encodings:

May I suggest you convert the file to UTF-8 and then feed it to your 
Perl script?

Obviously you need to know what encoding the Vcard file is in before you 
attempt any conversion between encodings.  Do the Vcard files have a 
Byte Order Marking (a sequence of bytes) at the start of the file 
indicating that the file is in UTF-16?

If they do then check for that and then do the conversion from UTF-16 to 
UTF-8.





------------------------------

Date: Mon, 16 Apr 2007 19:21:51 GMT
From: "Jürgen Exner" <jurgenex@hotmail.com>
Subject: Re: UTF16 input file to ISO-8859-1 output
Message-Id: <j1QUh.17366$807.9219@trndny09>

R Wood wrote:
[long posting snipped]

Without reading through all that text and code: did you have a look at the 
Text::IConv module?
When I had to convert text between a dozen different encodings a while ago 
it proved to be very useful.

jue




------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 351
**************************************


home help back first fref pref prev next nref lref last post