[32027] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 3291 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Feb 19 11:09:30 2011

Date: Sat, 19 Feb 2011 08:09:07 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sat, 19 Feb 2011     Volume: 11 Number: 3291

Today's topics:
    Re: Hashes are good, but not good enough. <hjp-usenet2@hjp.at>
    Re: Hashes are good, but not good enough. <hjp-usenet2@hjp.at>
    Re: Hashes are good, but not good enough. <blgl@stacken.kth.se>
    Re: List Separator $, behaving oddly sharma__r@hotmail.com
    Re: List Separator $, behaving oddly <derykus@gmail.com>
        really need all the %{ ${ ${$_}{x} }{y} } jazz? jidanni@jidanni.org
    Re: Regex question; match <br> after opening tag <cartercc@gmail.com>
    Re: Regex question; match <br> after opening tag <tadmc@seesig.invalid>
        Rep Exp multiple matches <kiloran.public+news@gmail.com>
    Re: Rep Exp multiple matches <kiloran.public+news@gmail.com>
    Re: Rep Exp multiple matches <rk_dn@yahoo.de>
    Re: Rep Exp multiple matches <jurgenex@hotmail.com>
    Re: Rep Exp multiple matches sharma__r@hotmail.com
    Re: Strategies for modifying marked-up text? <cartercc@gmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 18 Feb 2011 20:55:56 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Hashes are good, but not good enough.
Message-Id: <slrniltjmd.fcs.hjp-usenet2@hrunkner.hjp.at>

On 2011-02-13 22:46, Ilya Zakharevich <nospam-abuse@ilyaz.org> wrote:
> On 2011-02-12, Peter J  Holzer <hjp-usenet2@hjp.at> wrote:
>>> I find this claim 
>>> 
>>>   a DAWG is very similar to a trie, but it is much more space efficient
>>> 
>>> very suspicious.  If you want DAWG to be dynamic, then extra "parent"
>>> pointers you need to keep would, in most situations, far outweight the
>>> space savings.  If DAWG is not intended to change, you do not need
>>> parent pointers,
>>
>> I don't think a DAWG is intended to change. Inserts and deletions look
>> expensive to me. You probably build it once from a collection of words
>> and then do only lookups.
>
> There is no problem on deletion.

A deletion may force you to split nodes. For example, use the DAWG from
the wikipedia article which represents the words "tap", "taps", "top",
"tops":


digraph "g1" {
    n0 -> n1 [ label="t" ];
    n1 -> n2 [ label="a" ];
    n1 -> n2 [ label="o" ];
    n2 -> n3 [ label="p" ];
    n3 -> n4 [ label="s" ];
    n3 -> n5 [ label="EOW" ];
    n4 -> n5 [ label="EOW" ];
}

[use the dotty program from the graphviz package to view the graph]

When you delete "tops" from this DAWG, nodes n2 and n3 can no longer be
used for "top" since they would also lead to "tops". So you have to
create new nodes for "top":

digraph "g2" {
    n0 -> n1 [ label="t" ];
    n1 -> n2 [ label="a" ];
    n1 -> n6 [ label="o" ];
    n2 -> n3 [ label="p" ];
    n3 -> n4 [ label="s" ];
    n3 -> n5 [ label="EOW" ];
    n4 -> n5 [ label="EOW" ];
    n6 -> n7 [ label="p" ];
    n7 -> n5 [ label="EOW" ];
}

In this case I think this is the only possible result of the deletion,
but in more complex DAWGs there may be more than one possible result, so
you also have to find the optimal out of several possible results. 

> For insertion, one needs a
> dictionary of suffixes - which is, essentially, given by parent
> pointers, since they form a DAWG as well.

An insertion may add or remove nodes (or both). For example if we start
at ("tap", "top"):

digraph "g3" {
    n0 -> n1 [ label="t" ];
    n1 -> n2 [ label="a" ];
    n1 -> n2 [ label="o" ];
    n2 -> n3 [ label="p" ];
    n3 -> n5 [ label="EOW" ];
}

and add "taps", we need to split n2 and n3 to get graph g2. And if we
then add "tops" we can again merge nodes to get to graph g1. Having a
suffix dictionary probably helps (I haven't tried to implement it), but
in any case it doesn't look quite straightforward to me, and a suffix
dictionary needs extra space and since the whole reason for a DAWG is to
save space you wouldn't want that (I'm not sure whether you can even
call a DAWG with parent pointers a DAWG).


>>> but still, I can't imagine many situations in which
>>> the tail-folding would save much.
>>
>> The data structure seems to be designed to represent a dictionary of
>> (English) words. Many words share a common suffix, so would expect 
>> significant space savings compared to a trie.
>
> "Significant" in which sense?  For me, half an order of magnitude is
> significant.  30% is not - in most situations Perl is used for.

My rule of thumb is that it's worth considering if it's an improvement
by a factor of at least two (i.e., it saves at least 50% of memory). But
of course that depends on the problem. If I have a 16 bit system a
reduction from 70k (too big) to 60k (small enough) may be worthwhile. On
my workstation with 4GB RAM I may not care about the possibility of
reducing memory consumption from 2GB to 0.5 GB.

> I would expect it to be of second type for English words - but it is
> just a guts feeling...

My gut feeling is that it's better. But I don't have any data for it so
I won't make any claims.

> (And I was considered "DAWG vs *folded* tries"
> - those where a node contains a substring, not a char.)

I found several references to "folded tries", but all of them were about
storing unicode strings and they stored only part of a character per
node. Anyway, whatever optimization you use, you can also use it for a
DAWG, so I don't see the difference.


>> If you use it to store 
>> data without common suffixes it will degenerate into a trie, so it should
>> never be worse than a trie, IMHO. 
>
> DAWG may be MUCH worse than a folded trie - even if you fold a DAWG.

You didn't show why my assumption that DAWG will degenerate into a trie
in the worst case is false.

	hp



------------------------------

Date: Fri, 18 Feb 2011 21:18:47 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Hashes are good, but not good enough.
Message-Id: <slrniltl18.fcs.hjp-usenet2@hrunkner.hjp.at>

On 2011-02-14 15:58, Ted Zlatanov <tzz@lifelogs.com> wrote:
> On Fri, 11 Feb 2011 22:20:35 +0000 (UTC) Ilya Zakharevich <nospam-abuse@ilyaz.org> wrote: 
>
> IZ> On 2011-02-11, Ted Zlatanov <tzz@lifelogs.com> wrote:
> GM> We all find hashes pretty convenient, and helpful to make things fast; but
> GM> the problem with hashes is that they consume too much memory on big trees,
> GM> and they are too slow for hundred millions of keys.
>>> 
>>> This is true and any experienced programmer will tell you not to put
>>> hundreds of millions of keys in a hash in any language.
>
> IZ> ???!!!  How comes?  If what I need is a hash-like semantic, and I need
> IZ> "hundreds of millions of keys", why would not I use hash?  I never
> IZ> programmed in this "any language", so you can reduce your arguments to
> IZ> Perl.  Again, to be specific, assume that this particular process has
> IZ> ulimit of 100GB heap, and other processes do not use much memory, so
> IZ> there won't be any swapping...
>
> On Sat, 12 Feb 2011 07:46:56 +0100 "Peter J. Holzer" <hjp-usenet2@hjp.at> wrote: 
>
> PJH> Actually, only part of this is true. Hashes do consume a lot of memory,
> PJH> but they are not slow for a large number of keys. As Willem correctly
> PJH> pointed out, they are O(n) on the length of the key but O(1) on the
> PJH> number of keys. As long as your hash fits into RAM, lookups and
> PJH> insertions are the same speed whether your hash contains 100 keys or
> PJH> 100,000,000.
>
> Hundreds of millions of hash keys will take up a lot of memory, probably
> unnecessarily.

If it's not necessary then don't use it. We are of course talking about
the case where a hash is the appropriate data structure.

> Iterating through them will be painful even unsorted (as
> with any large amount of data, it's the management that's hard, not
> accumulating it).

Hashes are generally used for fast random lookup, not for iterating. 

> I would *at least* partition the keys with a
> multi-level hash.

That takes even more memory, possibly a lot more (depends where you
split the key), and it doesn't improve locality. It may help if you only
need to access data from one partition, but then why are you putting the
other data into memory at all? It also helps for sorting, but you may
not have to sort your data.


> So I stand by "don't put hundreds of millions of keys
> in a hash" whether you're using Perl or Java or C or C++.  It may be
> possible, but it's not practical.

It is practical as long as it fits into memory. One case where I could
have used a hash of 50+ million elements (but I could't at the time,
since that was on a 32 bit system so it wouldn't fit into memory) was a
comparison between to datasets: Read the first one into a hash, then
step through the second one record by record and look up the data from
the first. Total time: roughly the time to read the two files
sequentially. What I did instead was sort both datasets and them compare
the sorted files. Works, but was a lot slower (but still faster then
letting Oracle do the comparison).


> If a *string set* is required, then much more compact structures exist.

Yes. But you weren't talking about string sets. You claimed that having
huge hashes is slow (it isn't) and always a mistake (it isn't).

	hp



------------------------------

Date: Sat, 19 Feb 2011 12:41:36 +0100
From: Bo Lindbergh <blgl@stacken.kth.se>
Subject: Re: Hashes are good, but not good enough.
Message-Id: <ijoa9e$fjl$1@speranza.aioe.org>

In article <slrnilgnpg.b2b.nospam-abuse@powdermilk.math.berkeley.edu>,
 Ilya Zakharevich <nospam-abuse@ilyaz.org> wrote:
> For insertion, one needs a
> dictionary of suffixes - which is, essentially, given by parent
> pointers, since they form a DAWG as well.

No, they don't.  Consider the DAWG matching the two words "abc" and "bc".
The vertex with an outgoing edge labelled "c" has two incoming edges,
both labelled "b".

In article <844537681319214326.522087hjp-usenet2-hjp.at@zeno.hjp.at>,
 Peter J. Holzer <hjp-usenet2@hjp.at> wrote:
> The data structure seems to be designed to represent a dictionary of
> (English) words. Many words share a common suffix, so would expect 
> significant space savings compared to a trie.

The /usr/share/dict/words that comes with Mac OS X 10.6 has 234936 words
totalling 2251877 bytes (newlines excluded).  A trie matching these
words (with a one-bit flag per vertex instead of end-of-word edges) has
791089 vertices.  An optimal DAWG based on this trie has 130892 vertices.
I don't know about you people, but I call a factor of six "significant
savings".


/Bo Lindbergh


------------------------------

Date: Fri, 18 Feb 2011 11:27:08 -0800 (PST)
From: sharma__r@hotmail.com
Subject: Re: List Separator $, behaving oddly
Message-Id: <98d5809a-050a-4a7c-9a9b-5c640bada5c2@o30g2000pra.googlegroups.com>

On Feb 18, 11:37=A0pm, "Peter J. Holzer" <hjp-usen...@hjp.at> wrote:
> On 2011-02-18 09:42, sharma...@hotmail.com <sharma...@hotmail.com> wrote:
>
>
>
> > when we set the $, punctuation variable two different ways we get
> > different behaviors.
>
> > perl -e '
> > # this does not work
> > local *{"::,"} =3D \do{q[:]};
> [...]
> > ';
>
> > perl -e '
> > # this works
> > local ${*{"::,"}} =3D q[:];
> [...]
> > ';
>
> > Why does this happen?
>
> May I ask why you =A0set $, in such a wierd way? Why don't you just use
>
> =A0 =A0 local $, =3D q[:];
>
> ?
>
> =A0 =A0 =A0 =A0 hp


I am trying to alter the punctuation variables inside a sub, like as
show below.

The idea is to make the sub immune from any settings of punctuation
variables that may be in effect
AND
also if we want to supersede the local punctuation inside a sub, we
may supply it from outside via a
hash whose keys are the punctuation variable symbols.

#!/usr/local/bin/perl
use strict;
use warnings;

local $, =3D q{/}; # <--- top-level setting of list separator
print "Before \$,=3D[$,] ";
print qw(A B C);
printf "\n";
my %h =3D (
	',' =3D> '+',
);

sub klm {
	my %h =3D %{$_[0]};
	local $, =3D q{=3D}; # <--- setting inside the sub(use this if not forced
from outside)
	no strict 'refs';
	${*{"::$_"}} =3D  $h{$_} for keys %h; # <--- this works, meaning proper
$, takes effect as also print shows proper separtors
#	  *{"::$_"}  =3D \$h{$_} for keys %h; # <--doesn't work, meaning
proper $, takes effect but print does not show proper separtion
	print "Inside \$,=3D[$,] ";
	print qw(A B C);
	printf "\n";
}

klm \%h;

print "After \$,=3D[$,] ";
print qw(A B C);
printf "\n";
__END__


------------------------------

Date: Fri, 18 Feb 2011 21:42:42 -0800 (PST)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: List Separator $, behaving oddly
Message-Id: <b410f44c-90de-43b4-b6bc-1223e06a315d@o30g2000pra.googlegroups.com>

On Feb 18, 1:42=A0am, sharma...@hotmail.com wrote:
> hello all,
>
> when we set the $, punctuation variable two different ways we get
> different behaviors.
>
> perl -e '
> # this does not work
> local *{"::,"} =3D \do{q[:]};
> print "\$,=3D[$,]\n";
> print "A", "B", "C"; #<--- ABC are not separated by colon, even though
> $, is set to a colon.
> printf "\n";
> ';
>
> perl -e '
> # this works
> local ${*{"::,"}} =3D q[:];
> print "\$,=3D[$,]\n";
> print "A", "B", "C"; #<--- A:B:C are colon separated & ok.
> printf "\n";
> ';
>
> Why does this happen? Is this a bug.
>
> Using: perl, v5.8.8 built for i686-linux.
>

Hm, a later version works...

perl -v
This is perl 5, version 12, subversion 2 (v5.12.2) built
for MSWin32-x86-multi-thread

perl -wle "local *{'::,'} =3D \do{q[:]};print qq{\$,=3D[$,]};print
'A','B','C'"
$,=3D[:]
A:B:C

--
Charles DeRykus


------------------------------

Date: Sat, 19 Feb 2011 10:37:43 +0800
From: jidanni@jidanni.org
Subject: really need all the %{ ${ ${$_}{x} }{y} } jazz?
Message-Id: <ijnaf4$v21$1@news.datemas.de>

Gentlemen, my program runs great, but do I really have to use all that wacky
 %{ ${ ${$_}{geometry} }{location} } )
stuff that I worked out using Data::Dumper, or is there a more familiar
way to write it?

use strict; use warnings FATAL => 'all'; local $/;
use JSON -support_by_pp;
my $v = from_json( <>, { utf8 => 0 } );
use feature "switch";
given ( $$v{status} ) {
    when ('OK') { }
    default     { die $_; }
}
for ( @{ $$v{results} } ) {
    printf "%.7f %.7f %s %s\n",
      reverse( values %{ ${ ${$_}{geometry} }{location} } ),
      ${$_}{formatted_address}, ${$_}{geometry}{location_type};

}


------------------------------

Date: Fri, 18 Feb 2011 11:35:03 -0800 (PST)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: Regex question; match <br> after opening tag
Message-Id: <1fd48ef1-b978-471a-9fc4-975a1e50980a@t19g2000prd.googlegroups.com>

On Feb 18, 1:42=A0pm, "Peter J. Holzer" <hjp-usen...@hjp.at> wrote:
> <br title=3D"a <br> element">

not valid HTML.

<br title=3D"a &lt;br&gt; element" />

CC.


------------------------------

Date: Fri, 18 Feb 2011 15:03:10 -0600
From: Tad McClellan <tadmc@seesig.invalid>
Subject: Re: Regex question; match <br> after opening tag
Message-Id: <slrniltnaf.7cd.tadmc@tadbox.sbcglobal.net>

ccc31807 <cartercc@gmail.com> wrote:
> On Feb 18, 1:42 pm, "Peter J. Holzer" <hjp-usen...@hjp.at> wrote:
>> <br title="a <br> element">
>
> not valid HTML.


Yes it is.


><br title="a &lt;br&gt; element" />


That is XHTML.

XHTML is not the same as HTML.


-- 
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.


------------------------------

Date: Sat, 19 Feb 2011 13:10:15 +0000
From: kiloran <kiloran.public+news@gmail.com>
Subject: Rep Exp multiple matches
Message-Id: <s9SdnXicPIujXMLQnZ2dnUVZ8gudnZ2d@bt.com>

Suppose I have a string such as:

$string = 'abc3abc2xyz4abc5def6abc1';

Note that the string could be of any length.

I want to print all the digits which follow 'abc'. Like:

3
2
5
1

I don't see how to handle the multiple occurrences using a regular 
expression. I imagine I need some form of loop, such as WHILE, but 
cannot see how to construct this.

Can anyone give me a few pointers?

--Alan


------------------------------

Date: Sat, 19 Feb 2011 13:24:16 +0000
From: kiloran <kiloran.public+news@gmail.com>
Subject: Re: Rep Exp multiple matches
Message-Id: <0KmdnSZT7tgaWcLQnZ2dnUVZ8kmdnZ2d@bt.com>

On 19/02/2011 13:10, kiloran wrote:
> Suppose I have a string such as:
>
> $string = 'abc3abc2xyz4abc5def6abc1';
>
> Note that the string could be of any length.
>
> I want to print all the digits which follow 'abc'. Like:
>
> 3
> 2
> 5
> 1
>
> I don't see how to handle the multiple occurrences using a regular
> expression. I imagine I need some form of loop, such as WHILE, but
> cannot see how to construct this.
>
> Can anyone give me a few pointers?
>
> --Alan

OK, I think I've found it. Takes time to read through all of the 
documentation!

while ($string =~ /abc(\d)/g) {
         print "Word is $1, ends at position ", pos $string, "\n";
     }

--Alan


------------------------------

Date: Sat, 19 Feb 2011 14:26:09 +0100
From: Rafael Koeppen <rk_dn@yahoo.de>
Subject: Re: Rep Exp multiple matches
Message-Id: <ijogd9$poa$01$1@news.t-online.com>

Am 19.02.2011 14:10, schrieb kiloran:
> Suppose I have a string such as:
>
> $string = 'abc3abc2xyz4abc5def6abc1';
>
> Note that the string could be of any length.
>
> I want to print all the digits which follow 'abc'. Like:
>
> 3
> 2
> 5
> 1
>
> I don't see how to handle the multiple occurrences using a regular
> expression. I imagine I need some form of loop, such as WHILE, but
> cannot see how to construct this.
>
> Can anyone give me a few pointers?
>
> --Alan

for example:

while ($string =~ /abc(\d)/g) {
   print $1 . "\n";
}


--
Rafael


------------------------------

Date: Sat, 19 Feb 2011 05:43:19 -0800
From: J�rgen Exner <jurgenex@hotmail.com>
Subject: Re: Rep Exp multiple matches
Message-Id: <lfhvl6p7cfmdid8ogi2dmp3dk4v2vgkghp@4ax.com>

kiloran <kiloran.public+news@gmail.com> wrote:
>Suppose I have a string such as:
>
>$string = 'abc3abc2xyz4abc5def6abc1';
>
>Note that the string could be of any length.
>
>I want to print all the digits which follow 'abc'. Like:
>
>3
>2
>5
>1
>
>I don't see how to handle the multiple occurrences using a regular 
>expression. I imagine I need some form of loop, such as WHILE, but 
>cannot see how to construct this.

No, you just need the 'g' modifier. One way:

	my @found = $string =~ m/abc\d/g;
	print join("\n", map substr($_,3,1), @found);

jue


------------------------------

Date: Sat, 19 Feb 2011 06:14:20 -0800 (PST)
From: sharma__r@hotmail.com
Subject: Re: Rep Exp multiple matches
Message-Id: <80b39923-cb73-49a9-bc28-14912b4c8400@w7g2000pre.googlegroups.com>

On Feb 19, 6:10=A0pm, kiloran <kiloran.public+n...@gmail.com> wrote:
> Suppose I have a string such as:
>
> $string =3D 'abc3abc2xyz4abc5def6abc1';
>
> Note that the string could be of any length.
>
> I want to print all the digits which follow 'abc'. Like:
>
> 3
> 2
> 5
> 1
>
> I don't see how to handle the multiple occurrences using a regular
> expression. I imagine I need some form of loop, such as WHILE, but
> cannot see how to construct this.
>
> Can anyone give me a few pointers?
>
> --Alan


use the m// operator in list context with the "g" switch turned on,
like as:

print for $string =3D~ m/(?<=3Dabc)\d/g;

--Rakesh


------------------------------

Date: Fri, 18 Feb 2011 11:31:36 -0800 (PST)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: Strategies for modifying marked-up text?
Message-Id: <dcb7e95c-123a-41ab-9334-81af0e8712bb@a28g2000prb.googlegroups.com>

On Feb 18, 1:45=A0pm, "Peter J. Holzer" <hjp-usen...@hjp.at> wrote:
> One of us completely misunderstood what Thomas is trying to achieve.

Could be me. I'm real good at that. ;-)

> As I understood it, he wants the substitution to succeed even if the
> text in the file is
>
> =A0 =A0 ... George W. <span class=3D"lastname">Bush</span> ...

Yeah, but the OP wrote:

>> I'm looking for input on how to run search/replace
>> operations on paragraphs of HTML text without having
>> to worry about the surrounding markup.

If all you are doing is searching and replacing for specific patterns,
the surrounding text doesn't matter, whether or not it's HTML markup.

CC.


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3291
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[32027] in Perl-Users-Digest

Perl-Users Digest, Issue: 3291 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Sat Feb 19 11:09:30 2011

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Feb 19 11:09:30 2011