[32044] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 3308 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Mar 5 14:09:45 2011

Date: Sat, 5 Mar 2011 11:09:10 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sat, 5 Mar 2011     Volume: 11 Number: 3308

Today's topics:
        contexts <k4monk@gmail.com>
    Re: contexts <uri@StemSystems.com>
    Re: contexts <tadmc@seesig.invalid>
    Re: contexts <jurgenex@hotmail.com>
    Re: contexts <k4monk@gmail.com>
    Re: contexts <uri@StemSystems.com>
    Re: contexts <smallpond@juno.com>
    Re: contexts <sherm.pendley@gmail.com>
    Re: going from CPAN to RPM <hjp-usenet2@hjp.at>
    Re: Hashes are good, but not good enough. <hjp-usenet2@hjp.at>
    Re: Hashes are good, but not good enough. <hjp-usenet2@hjp.at>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 4 Mar 2011 15:00:52 -0800 (PST)
From: K4 Monk <k4monk@gmail.com>
Subject: contexts
Message-Id: <b4e0c07a-6278-4239-bcb1-31490d2bcb54@t19g2000prd.googlegroups.com>

I was reading a funny (and quite informative) article on computer
languages ( http://sites.google.com/site/steveyegge2/tour-de-babel )
when I came across the following about Perl:

"Take its "contexts", for instance, which are a horrid outgrowth of
Larry's comical decision to have N variable namespaces, dereferenced
by sigils, which he sort of copied from shell-script. In Perl, every
operator, every function, every operation in the language behaves
randomly in one of six different ways, depending on the current
"context". "

what is he trying to say here?


------------------------------

Date: Fri, 04 Mar 2011 19:39:52 -0500
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: contexts
Message-Id: <87ipvy5rjb.fsf@quad.sysarch.com>

>>>>> "KM" == K4 Monk <k4monk@gmail.com> writes:

  KM> I was reading a funny (and quite informative) article on computer
  KM> languages ( http://sites.google.com/site/steveyegge2/tour-de-babel )
  KM> when I came across the following about Perl:

  KM> "Take its "contexts", for instance, which are a horrid outgrowth of
  KM> Larry's comical decision to have N variable namespaces, dereferenced
  KM> by sigils, which he sort of copied from shell-script. In Perl, every
  KM> operator, every function, every operation in the language behaves
  KM> randomly in one of six different ways, depending on the current
  KM> "context". "

  KM> what is he trying to say here?

that the author of the article is an idiot. he does a very good job of
making that point. he obviously doesn't know perl and given his blather
in that one short paragraph, doesn't know much about computer language
design.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Fri, 04 Mar 2011 19:56:25 -0600
From: Tad McClellan <tadmc@seesig.invalid>
Subject: Re: contexts
Message-Id: <slrnin35oj.9uc.tadmc@tadbox.sbcglobal.net>

K4 Monk <k4monk@gmail.com> wrote:

> In Perl, every
> operator, every function, every operation in the language behaves
> randomly in one of six different ways, 


There are two, not six, different ways (list context and scalar context).

They are not random, they are quite well defined.

When you want "one thing" you get scalar context, when you
want "many things" you get list context.


> depending on the current
> "context". "
>
> what is he trying to say here?


That he cannot count? 

Even when he didn't need to remove his shoes first...


-- 
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.


------------------------------

Date: Fri, 04 Mar 2011 18:01:19 -0800
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: contexts
Message-Id: <o563n6lvq70rald1vbqbgqf6brdfen3dkn@4ax.com>

K4 Monk <k4monk@gmail.com> wrote:
>I was reading a funny (and quite informative) article on computer
>languages ( http://sites.google.com/site/steveyegge2/tour-de-babel )
>when I came across the following about Perl:
>
>"Take its "contexts", for instance, which are a horrid outgrowth of
>Larry's comical decision to have N variable namespaces, dereferenced
>by sigils, which he sort of copied from shell-script. In Perl, every
>operator, every function, every operation in the language behaves
>randomly in one of six different ways, depending on the current
>"context". "
>
>what is he trying to say here?

That he doesn't know much about Perl (or fundamentals of programming
languages), but that he is very happy to proclaim his ignorance to the
world.

jue


------------------------------

Date: Fri, 4 Mar 2011 19:31:53 -0800 (PST)
From: K4 Monk <k4monk@gmail.com>
Subject: Re: contexts
Message-Id: <54f44b9e-0de3-441b-b1f3-d588895b28f5@u24g2000prn.googlegroups.com>

On Mar 5, 7:01=A0am, J rgen Exner <jurge...@hotmail.com> wrote:
> K4 Monk <k4m...@gmail.com> wrote:
> >I was reading a funny (and quite informative) article on computer
> >languages (http://sites.google.com/site/steveyegge2/tour-de-babel)
> >when I came across the following about Perl:
>
> >"Take its "contexts", for instance, which are a horrid outgrowth of
> >Larry's comical decision to have N variable namespaces, dereferenced
> >by sigils, which he sort of copied from shell-script. In Perl, every
> >operator, every function, every operation in the language behaves
> >randomly in one of six different ways, depending on the current
> >"context". "
>
> >what is he trying to say here?
>
> That he doesn't know much about Perl (or fundamentals of programming
> languages), but that he is very happy to proclaim his ignorance to the
> world.
>
> jue

Thanks, just a disclaimer:  I'm not trying to start an argument here.
He seems to have strong opinions but it was a fun read for the same
reason. Do you agree about what he says about other languages
(specifically Ruby?)


------------------------------

Date: Fri, 04 Mar 2011 23:48:54 -0500
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: contexts
Message-Id: <871v2m5g09.fsf@quad.sysarch.com>

>>>>> "KM" == K4 Monk <k4monk@gmail.com> writes:

  KM> Thanks, just a disclaimer: I'm not trying to start an argument
  KM> here.  He seems to have strong opinions but it was a fun read for
  KM> the same reason. Do you agree about what he says about other
  KM> languages (specifically Ruby?)

as i said, i won't read it just because of his stupid comments on
perl. you can knock perl (or ANY lang) for many legit reasons but to
flame about it and show so little understanding of contexts says he is a
dope and not worth reading.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Sat, 5 Mar 2011 07:41:42 -0800 (PST)
From: smallpond <smallpond@juno.com>
Subject: Re: contexts
Message-Id: <7a6085bf-8fd7-4186-ae24-3a821ee84587@j13g2000pro.googlegroups.com>

On Mar 4, 6:00=A0pm, K4 Monk <k4m...@gmail.com> wrote:
> I was reading a funny (and quite informative) article on computer
> languages (http://sites.google.com/site/steveyegge2/tour-de-babel)
> when I came across the following about Perl:
>
> "Take its "contexts", for instance, which are a horrid outgrowth of
> Larry's comical decision to have N variable namespaces, dereferenced
> by sigils, which he sort of copied from shell-script. In Perl, every
> operator, every function, every operation in the language behaves
> randomly in one of six different ways, depending on the current
> "context". "
>
> what is he trying to say here?

That he doesn't know what the word "randomly" means.


------------------------------

Date: Sat, 05 Mar 2011 10:54:03 -0500
From: Sherm Pendley <sherm.pendley@gmail.com>
Subject: Re: contexts
Message-Id: <m2lj0tef6s.fsf@sherm.shermpendley.com>

K4 Monk <k4monk@gmail.com> writes:

> I was reading a funny (and quite informative) article on computer
> languages

I wouldn't call it "informative." Given the name of the blog (Stevey's
drunken blog rants) and the ridiculous nature of much of the criticism,
I suspect he might be trying to poke fun at language fanbois. I hope
that's his goal, because if he really meant this seriously, he needs
to sober up and hit the books.

sherm--

-- 
Sherm Pendley
                                   <http://camelbones.sourceforge.net>
Cocoa Developer


------------------------------

Date: Sat, 5 Mar 2011 18:34:02 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: going from CPAN to RPM
Message-Id: <slrnin4t0a.jb9.hjp-usenet2@hrunkner.hjp.at>

On 2011-03-04 20:28, Martijn Lievaart <m@rtij.nl.invlalid> wrote:
> On Fri, 04 Mar 2011 12:21:48 -0500, Art Werschulz wrote:
>> Justin C <justin.1102@purestblue.com> writes:
>>> On 2011-02-28, Art Werschulz <agw@dsm.fordham.edu> wrote:
>>>> Although we have used the CPAN shell to install Perl modules on our
>>>> Linux systems, we would like to use the RPM versions of same instead.
>>>> This means that we need to find out which Perl modules were installed
>>>> via the CPAN shell, nuke same, and then install the RPM versions (say,
>>>> via yum).
>>>>
>>>> How can we find out which Perl modules were installed via the CPAN
>>>> shell?
>>>
>>> perldoc -q "which modules are installed"
>> 
>> Not quite (I don't think).  I want to know which modules have been
>> installed via the CPAN shell, as opposed to which ones have been
>> installed via RPMs.
>
> Maybe something like:
>
> sed 's/::/-/g's/^/perl-/ installed-modules.txt | xargs rpm -q
>
> At least on RedHat and derivatives. Dunno about other RPM distros.

On Redhat, packages installed via RPM generally reside in vendor_perl,
while packages installed via CPAN reside in site_perl. This may also be
helpful in distinguishing them.

	hp



------------------------------

Date: Sat, 5 Mar 2011 17:29:08 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Hashes are good, but not good enough.
Message-Id: <slrnin4p6k.jb9.hjp-usenet2@hrunkner.hjp.at>

On 2011-02-21 01:07, Ilya Zakharevich <nospam-abuse@ilyaz.org> wrote:
> On 2011-02-18, Peter J. Holzer <hjp-usenet2@hjp.at> wrote:
>>> There is no problem on deletion.
>>
>> A deletion may force you...
>
> Now I see how much I goofed here...  Thanks!
>
>>> (And I was considered "DAWG vs *folded* tries"
>>> - those where a node contains a substring, not a char.)
>
>> I found several references to "folded tries", but all of them were about
>> storing unicode strings and they stored only part of a character per
>> node. Anyway, whatever optimization you use, you can also use it for a
>> DAWG, so I don't see the difference.
>
> Won't work for DAWGs: E.g., in a folded trie for /usr/bin/words, after

ITYM /usr(/share)/dict/words.

> I reach prefix "a" "b" "d" "i", all I need to store is "cate".  I do
> not expect you get any significant number of such chains of letters in
> the corresponding DAWG.

I don't see why you couldn't do the same thing in a DAWG. In addition,
you can reuse the suffix /cate(|d|s)$/ for the prefixes "adjudi",
"advo", "allo", "authenti", etc.


>>>> If you use it to store 
>>>> data without common suffixes it will degenerate into a trie, so it should
>>>> never be worse than a trie, IMHO. 
>>>
>>> DAWG may be MUCH worse than a folded trie - even if you fold a DAWG.
>>
>> You didn't show why my assumption that DAWG will degenerate into a trie
>> in the worst case is false.
>
> I just do not see why "optimizing" a folded trie into a (folded) DAWG
> would ALWAYS save space.

I didn't say it always saves space, I said it cannot take more space. In
the worst case (no common suffixes) a DAWG just degenerates into a trie.

> At LEAST, this should depend on the size of overhead per node.

This should be the same as the information per node is the same in both
structures (a list of possible continuations).

There are a lot of ways a node in a trie or DAWG can be implemented. 

It can be a fixed size array of pointers (I think this is the "classic"
trie - obviously this is only useful if you have small alphabet, e.g.
the 26 characters of the latin alphabet; having 1114112 pointers per
node for each possible unicode character would be clearly impractical).

It can be an array of pointers assoiciated with a range of characters.

It can be an array of (character, pointer) pairs. This array can be
sorted or unsorted.

It can be a linked list of (character, pointer) pairs (that adds another
pointer, of course).

Where the character is explicitely stored, it can also be a substring. 

Instead of a character you can also use part of a character (e.g. a
single byte of the UTF-8 encoding).

Instead of a pointer you can use an offset into an array.  The offset
can also be relative.


As should be obvious the size per node can vary enormously depending on
the chosen representation. It also depends on the dictionary (because
some representations have nodes of variable size).  So which
representation has the smallest memory footprint depends on the
dictionary and the machine architecture. And of course, a small memory
footprint may not be your only criterion. Maybe you are more interested
in fast lookups, or in an easy way to add or delete members.

All the variations I listed above apply to tries and DAWGs alike. It is
possible that a specific space optimization works better for tries than
for DAWGs, but that isn't obvious at all and I'd need to see a specific
example to believe it.


> Storing M strings of length L would take about LM
> bytes of memory (imaging these strings as tails in a folded trie).
> Now convert them into a DAWG - you get N nodes of size S.  Obviously,
> for large enough S, NS is much more than LM.

You were comparing tries to DAWGs, not strings to DAWGs. So you must
compare N(t) * S(t) (number of nodes in the trie times average size of
each node) with N(d) * S(d) (number of nodes in the DAWG times average
size of the node). S(t) and S(d) should be roughly the same (they
contain the same information), and N(d) should not be greater than N(t)
(if it is you can always reorganize the DAWG into a trie).


> My guts feeling is that even with reasonable implemementations, QUITE
> OFTEN one would have NS > ML.

Possible, but that applies equally to tries as to DAWGs.

	hp


------------------------------

Date: Sat, 5 Mar 2011 18:25:10 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Hashes are good, but not good enough.
Message-Id: <slrnin4sfn.jb9.hjp-usenet2@hrunkner.hjp.at>

On 2011-02-22 15:02, Ted Zlatanov <tzz@lifelogs.com> wrote:
> On Fri, 18 Feb 2011 21:18:47 +0100 "Peter J. Holzer" <hjp-usenet2@hjp.at> wrote: 
> PJH> On 2011-02-14 15:58, Ted Zlatanov <tzz@lifelogs.com> wrote:
>>> Hundreds of millions of hash keys will take up a lot of memory, probably
>>> unnecessarily.
>
> PJH> If it's not necessary then don't use it. We are of course talking about
> PJH> the case where a hash is the appropriate data structure.
>
> This is a circular argument: of course if it's the appropriate data
> structure, it's the appropriate data structure.

Not quite. I say that when a hash is the appropriate data structure in
general, it very often is the appropriate data structure for a large
number of elements, because a hash scales well (lookup, insertion and
deletion are O(1)).

It becomes impractical only when the size exceeds the size of available
RAM (but note that many databases also use hash-based indexes, so hashes
are also usable for on-disk data, you just have to organize your data a
bit differently).


> My point was that
> taking this approach is impractical generally.

And my point is that hashes are *generally* well suited for large data
structures because of their O(1) behaviour for all operations. Most
other data structures scale less well.

>
>>> Iterating through them will be painful even unsorted (as
>>> with any large amount of data, it's the management that's hard, not
>>> accumulating it).
>
> PJH> Hashes are generally used for fast random lookup, not for iterating. 
>
> Sure, once it's loaded, it's fast.  But the startup time is atrocious

The startup time is O(n). For most other data structures it's worse (for
trees it's O(n * log n), for example). The insertion time for each
element may be a bit long (on my test system, inserting a hash member
with a short key takes about 2µs - 10 times as long as inserting an
array member), but that's the same for small hashes as for large hashes,
so it's not an argument against large hashes specifically.

Of course loading a hash with 100 million members into memory and then
doing a single lookup would be silly. But again that's true for any data
structure. The time to build the data structure must amortize itself
over the lifetime of the structure. So a huge hash makes only sense if
the time to build it is small compared to the time saved by doing fast
lookups.  Adding 100 million records into a database takes even longer
(a *lot* longer), but if it lives long enough it may eventually amortize
itself.


> and you can only do lookups.

You can also walk the whole hash in hash order. If you have different
requirements, don't use a hash. But again that's mostly independent of
the hash size.

> Also, do you need deletions, and if so, are you happy with Perl's
> memory management?

I've griped about Perl's memory management often enough that you
probably know that I'm not happy with it :-). But it doesn't actually
treat large hashes badly. Deleting a single member is still O(1) - the
reference counting GC is good in this case, I think a mark-sweep based
one would be worse. What's bad is freeing a large hash at once - it's
equivalent to deleting each member in turn. But again, that's the same
for any data structure which consists of a large number of objects in
perl.


>>> I would *at least* partition the keys with a multi-level hash.
>
> PJH> That takes even more memory, possibly a lot more (depends where you
> PJH> split the key), and it doesn't improve locality. It may help if you only
> PJH> need to access data from one partition, but then why are you putting the
> PJH> other data into memory at all? It also helps for sorting, but you may
> PJH> not have to sort your data.
>
> In other words, it may help and it may not, and we don't know if we
> don't know the specific use case.  Right.

You forgot the that it may also hurt a lot. So the advice to always
partition large hashes without looking at the specific use case may
backfire badly.


>>> So I stand by "don't put hundreds of millions of keys
>>> in a hash" whether you're using Perl or Java or C or C++.  It may be
>>> possible, but it's not practical.

So you don't know the use case but you give general advice? Right.


> PJH> It is practical as long as it fits into memory. One case where I could
> PJH> have used a hash of 50+ million elements (but I could't at the time,
> PJH> since that was on a 32 bit system so it wouldn't fit into memory) was a
> PJH> comparison between to datasets: Read the first one into a hash, then
> PJH> step through the second one record by record and look up the data from
> PJH> the first. Total time: roughly the time to read the two files
> PJH> sequentially. What I did instead was sort both datasets and them compare
> PJH> the sorted files. Works, but was a lot slower (but still faster then
> PJH> letting Oracle do the comparison).
>
> This is similar to the case where I ended up with Sybase because the
> Perl-based approach didn't scale.

That doesn't have much to do with the scalability of Perl hashes (within
RAM limits). You delegated a task to a program written for that task in
C instead of writing it in Perl and discovered that it was faster at it
than your Perl program. Big surprise. (BTW, did you look at the
execution plan? I wouldn't be surprised if it said "hash join" at some
place or other).


> it would have been a fast and sufficient approach to just use a hash,
> but against Sybase I was able to parallelize the queries and do specific
> comparisons on the interesting fields instead of the whole record.  It
> cut the run time significantly to do it against an RDBMS.  As I said, it
> was faster to just use a hash at first, but it quickly became an
> unmanageable hack.  That was my experience.

The possibility of parallelization may be an advantage, but that's a
weakness of Perl generally, not specific to large hashes. 

Limiting the comparisons to interesting fields should be possible with
hashes also (in fact I would strongly advise to load only the
interesting data into memory to reduce memory consumption).

Other than that I suspect that you just ran out of RAM. Perl is a memory
hog and adds a large overhead. But again, that's perl's weakness in
general and not specific to hashes (much less large hashes).


> PJH> You claimed that having huge hashes is slow (it isn't) and always a
> PJH> mistake (it isn't).
>
> OK, let's say it's fast for lookups in some cases and is a passable
> solution in some very rare cases.  Is that a fair summary?

No. There are use cases where hashes are appropriate and some where they
are not. This depends on a lot of factors but the the hash size is of
relatively small importance. In fact, hashes scale better than almost
any other data structure.

I strongly disagree with "small hashes good, large hashes bad". 

	hp



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3308
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[32044] in Perl-Users-Digest

Perl-Users Digest, Issue: 3308 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Sat Mar 5 14:09:45 2011

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Mar 5 14:09:45 2011