[27914] in Perl-Users-Digest
Perl-Users Digest, Issue: 9278 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Jun 10 11:05:53 2006
Date: Sat, 10 Jun 2006 08:05:04 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Sat, 10 Jun 2006 Volume: 10 Number: 9278
Today's topics:
best method to perform operations on word lists <massion@gmx.de>
Re: best method to perform operations on word lists <David.Squire@no.spam.from.here.au>
Re: best method to perform operations on word lists <rvtol+news@isolution.nl>
Re: best method to perform operations on word lists <massion@gmx.de>
Re: best method to perform operations on word lists <bart@nijlen.com>
Re: best method to perform operations on word lists <massion@gmx.de>
Re: best method to perform operations on word lists <David.Squire@no.spam.from.here.au>
Re: best method to perform operations on word lists <massion@gmx.de>
Re: best method to perform operations on word lists <David.Squire@no.spam.from.here.au>
Re: best method to perform operations on word lists <bart@nijlen.com>
Re: best method to perform operations on word lists <rvtol+news@isolution.nl>
Re: CGI for collecting phone numbers <mritty@gmail.com>
Re: CGI for collecting phone numbers <noreply@gunnar.cc>
Re: GIFS not working properly in JavaScript PopUps <David.Squire@no.spam.from.here.au>
Re: Merging potentially undefined hashes <benmorrow@tiscali.co.uk>
Re: Merging potentially undefined hashes <mritty@gmail.com>
Re: Still Error when connecting to database using Win32 <madan.narra@gmail.com>
Re: threads on XP-- system() works, backtic & popen dos <zentara@highstream.net>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 10 Jun 2006 00:10:28 -0700
From: "Francois Massion" <massion@gmx.de>
Subject: best method to perform operations on word lists
Message-Id: <1149923428.067468.167940@f6g2000cwb.googlegroups.com>
Hi folks,
I am rather bad at perl and would like some advice on the best
methodology to do the following:
I have a list of approx 20,000 terms extracted from a database. The
list is sorted alphabetically. The entries look like this:
=FCberzeugt
=FCberzeugt,
=FCberzogen
=FCberzogen,
=FCberzogen.
=FCblich
=FCbliche
=FCblichen
=FCblicherweise
I want to eliminate the variants of a basic word. In the example above
I want to end up with:
-=FCberzeugt
-=FCberzogen
-=FCblich
-=FCblicherweise
I have thought of the following:
(i) I read the list in a hash made of an index and the term
1 =3D=3D> =FCberzeugt
2 =3D=3D> =FCberzeugt,
etc.
(ii) I compare each term with its followers
(iii) if the following condition is not met, I delete the entry
(key+value) with "delete"
$term ist a substring of next term AND
the length difference is, say, below 3 (to avoid deleting
"=FCblicherweise" which is a different term)
I am not sure it is the right methodology. I don't like so much the
idea of creating artificially the index list (1 =3D=3D> Term1).
I wonder if I should work with references but it is sort of a blackbox
to me.
Any comments are appreciated.
Francois
------------------------------
Date: Sat, 10 Jun 2006 08:36:41 +0100
From: David Squire <David.Squire@no.spam.from.here.au>
Subject: Re: best method to perform operations on word lists
Message-Id: <e6dsqa$eo2$1@news.ox.ac.uk>
Francois Massion wrote:
> Hi folks,
>
> I am rather bad at perl and would like some advice on the best
> methodology to do the following:
>
> I have a list of approx 20,000 terms extracted from a database. The
> list is sorted alphabetically. The entries look like this:
>
> überzeugt
> überzeugt,
> überzogen
> überzogen,
> überzogen.
> üblich
> übliche
> üblichen
> üblicherweise
>
> I want to eliminate the variants of a basic word. In the example above
> I want to end up with:
> -überzeugt
> -überzogen
> -üblich
> -üblicherweise
[snip]
At first I thought you just wanted to remove trailing punctuation, which
is easy, but then I saw that you want to treat üblich, übliche, üblichen
as equivalent. This is hard to do properly.
In information retrieval, this is known as stemming. In linguistics it
is sometimes called lemmatization. To do this well requires a model of
the language in question. There two basic approaches: algorithmic
stemming, which is approximate (e.g. the famous Porter stemmer for
English), and look-up tables, that map all word variants to the
appropriate stem.
I know where to point you for some English stemmers, but not for German.
> I have thought of the following:
>
> (i) I read the list in a hash made of an index and the term
> 1 ==> überzeugt
> 2 ==> überzeugt,
> etc.
>
> (ii) I compare each term with its followers
>
> (iii) if the following condition is not met, I delete the entry
> (key+value) with "delete"
>
> $term ist a substring of next term AND
> the length difference is, say, below 3 (to avoid deleting
> "üblicherweise" which is a different term)
Well, this is the beginning of an stemming algorithm, but I foresee
problems. What about plurals created by changing vowel sounds in the
middle of a word with an umlaut?
If you go to cpan and search for stemmer, you will find quite a few
modules for stemming, including some that do German, but I do not know
about their internals/quality.
Regards,
DS
------------------------------
Date: Sat, 10 Jun 2006 09:43:09 +0200
From: "Dr.Ruud" <rvtol+news@isolution.nl>
Subject: Re: best method to perform operations on word lists
Message-Id: <e6e497.1es.1@news.isolution.nl>
Francois Massion schreef:
> I have a list of approx 20,000 terms extracted from a database. The
> list is sorted alphabetically. The entries look like this:
>
> überzeugt
> überzeugt,
> überzogen
> überzogen,
> überzogen.
> üblich
> übliche
> üblichen
> üblicherweise
You can first clean it up by removing the punctuations at the end of the
line, and then pipe it through uniq:
perl -ple 's/[.,]$//' infile | uniq > infile-1
> I want to eliminate the variants of a basic word. In the example above
> I want to end up with:
> -überzeugt
> -überzogen
> -üblich
> -üblicherweise
You are bound to loose more than you win.
If you are not in a hurry and have plenty of memory, you can slurp the
whole file in, and then do
1 while s/ \n (.+) \n \1 (?:e|en|t) \n /\n$1\n/x ;
but a while-loop that remembers the previous line is far more efficient.
--
Affijn, Ruud
"Gewoon is een tijger."
------------------------------
Date: 10 Jun 2006 01:05:33 -0700
From: "Francois Massion" <massion@gmx.de>
Subject: Re: best method to perform operations on word lists
Message-Id: <1149926733.336892.247090@i40g2000cwc.googlegroups.com>
Hi David,
Dag Mijnheer/Mevrouw Ruud,
That was a quick reply! Stemming and lemmatization are a known concept,
but I want to make this exercise language independant (the project I am
working on involves French, German and Polish as a matter of fact).
Therefore I cannot start to cut known characters from the end of a
word, it would be too time-consuming and unsafe.
I'll proceed first with the method described above which is probably
slow and not the most elegant but with some luck it'll work. I'll
report on the result...
Francois
------------------------------
Date: 10 Jun 2006 01:35:31 -0700
From: "Bart Van der Donck" <bart@nijlen.com>
Subject: Re: best method to perform operations on word lists
Message-Id: <1149928531.371750.159100@c74g2000cwc.googlegroups.com>
Francois Massion wrote:
> That was a quick reply! Stemming and lemmatization are a known concept,
> but I want to make this exercise language independant (the project I am
> working on involves French, German and Polish as a matter of fact).
> Therefore I cannot start to cut known characters from the end of a
> word, it would be too time-consuming and unsafe.
>
> I'll proceed first with the method described above which is probably
> slow and not the most elegant but with some luck it'll work. I'll
> report on the result...
I'm afraid there is no definable algorithm that you can use. What about
the following sample entries (Dutch words):
bi<->bis
de<->den
re<->ree
do<->doen
keer<->keren
etc...
I think you would need access to some dictionary files anyhow.
--
Bart
------------------------------
Date: 10 Jun 2006 01:57:38 -0700
From: "Francois Massion" <massion@gmx.de>
Subject: Re: best method to perform operations on word lists
Message-Id: <1149929858.330738.241780@h76g2000cwa.googlegroups.com>
Thanks Bart,
Well, the issue is really a matter of pragmatism. If I do the work
manually or with some VBA macros it will take ages. The situation I am
trying to address is not so uncommon to people working on glossary
issues. Therefore I am trying to find a language-independant solution
which works, say, for 90% of the words. It won't work in situations
with irregular plurals like the one you mention (or e.g. French for
"Work/works": "Travail / travaux", German for "House/Houses":
"Haus/H=E4user") or with character swaps in suffixes but at least it
would reduce substantially the number of cases to deal with.
I can define the length of a suffix with something like this:
for ($suffix=3D0 ; $suffix <=3D 1 ; $suffix++) {
or as a length difference between 2 words
I can also find out what is the root and the suffix of terms with
something like this:
$wordend =3D substr ($term,-$suffix);
$startposition =3D rindex ($term,$wordend); # position of suffix from
the end
$root =3D substr ($term,0,$startposition);
But for the moment I am struggling getting the value of one term and
the next one in order to compare them...Hope Dies Last !
Francois
------------------------------
Date: Sat, 10 Jun 2006 10:05:44 +0100
From: David Squire <David.Squire@no.spam.from.here.au>
Subject: Re: best method to perform operations on word lists
Message-Id: <e6e218$gcd$1@news.ox.ac.uk>
Francois Massion wrote:
> Thanks Bart,
>
> Well, the issue is really a matter of pragmatism. If I do the work
> manually or with some VBA macros it will take ages. The situation I am
> trying to address is not so uncommon to people working on glossary
> issues. Therefore I am trying to find a language-independant solution
> which works, say, for 90% of the words. It won't work in situations
> with irregular plurals like the one you mention (or e.g. French for
> "Work/works": "Travail / travaux", German for "House/Houses":
> "Haus/Häuser") or with character swaps in suffixes but at least it
> would reduce substantially the number of cases to deal with.
>
> I can define the length of a suffix with something like this:
> for ($suffix=0 ; $suffix <= 1 ; $suffix++) {
> or as a length difference between 2 words
>
> I can also find out what is the root and the suffix of terms with
> something like this:
> $wordend = substr ($term,-$suffix);
> $startposition = rindex ($term,$wordend); # position of suffix from
> the end
> $root = substr ($term,0,$startposition);
> But for the moment I am struggling getting the value of one term and
> the next one in order to compare them...Hope Dies Last !
Are your three languages mixed together, or do they occur in separate
files? If separate, even if the files are not tagged with the language
it should be possible to estimate it quickly with spell-check dictionary
look-ups, and then to use the appropriate language-specific stemmer.
Dictionaries, spell-checkers and stemmers for many languages are
available from CPAN.
DS
------------------------------
Date: 10 Jun 2006 02:33:57 -0700
From: "Francois Massion" <massion@gmx.de>
Subject: Re: best method to perform operations on word lists
Message-Id: <1149932037.439764.110340@h76g2000cwa.googlegroups.com>
Hi David,
The languages are not mixed. There are good stemmers for the most
"important" languages. A while ago I found some of them on the
following site:
http://snowball.tartarus.org/
But it is probably too complicated for my purpose.
Francois
------------------------------
Date: Sat, 10 Jun 2006 10:44:55 +0100
From: David Squire <David.Squire@no.spam.from.here.au>
Subject: Re: best method to perform operations on word lists
Message-Id: <e6e4ao$h5m$1@news.ox.ac.uk>
Francois Massion wrote:
> Hi David,
Thanks for the salutation, but it is not appropriate here. You are not
replying to me, you are posting to the whole comp.lang.perl.misc community.
Also, please quote context and retain attribution(s) (thus obviating the
need for a salutation) when replying, as has everyone who has replied to
your posts. See the posting guidelines for this group, which are posted
here twice a week.
>
> The languages are not mixed. There are good stemmers for the most
> "important" languages. A while ago I found some of them on the
> following site:
>
> http://snowball.tartarus.org/
>
> But it is probably too complicated for my purpose.
Well, there is a Perl module providing an interface to the snowball
stemmers available from CPAN, so it should not be at all complicated to
use them, and a much better long-term solution.
DS
------------------------------
Date: 10 Jun 2006 03:18:13 -0700
From: "Bart Van der Donck" <bart@nijlen.com>
Subject: Re: best method to perform operations on word lists
Message-Id: <1149934693.864208.20330@h76g2000cwa.googlegroups.com>
Francois Massion wrote:
> Well, the issue is really a matter of pragmatism. If I do the work
> manually or with some VBA macros it will take ages. The situation I am
> trying to address is not so uncommon to people working on glossary
> issues. Therefore I am trying to find a language-independant solution
> which works, say, for 90% of the words.
If you can afford such an error margin, here is a brute approach:
#!perl
use strict; use warnings;
my $list =3D
"=FCberzeugt
=FCberzeugt,
=FCberzogen
=FCberzogen,
=FCberzogen.
=FCblich
=FCbliche
=FCblichen
=FCblicherweise";
my @terms =3D split /\n/, $list;
my $prev =3D 'nonesuch584685542256RANOM58544';
s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;
@terms =3D grep($_ ne $prev && ($prev =3D $_), sort @terms);
print $_."\n" for @terms;
FWIW,
=20
--=20
Bart
------------------------------
Date: Sat, 10 Jun 2006 13:55:47 +0200
From: "Dr.Ruud" <rvtol+news@isolution.nl>
Subject: Re: best method to perform operations on word lists
Message-Id: <e6ej7a.1do.1@news.isolution.nl>
Bart Van der Donck schreef:
> s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;
Weaker alternative:
(1) s/ (?:e|en)? [.,]? $//x for @terms ;
or even
(2) s/ (?:en?)? [.,]? $//x for @terms ;
It's weaker, because the regex matches the empty string as well.
Proper alternative:
(3) s/(?: en? [.,]? | [.,] )$//x for @terms ;
because it matches at least /e$/ or /[.,]$/.
(untested)
--
Affijn, Ruud
"Gewoon is een tijger."
------------------------------
Date: 10 Jun 2006 06:18:13 -0700
From: "Paul Lalli" <mritty@gmail.com>
Subject: Re: CGI for collecting phone numbers
Message-Id: <1149945493.344947.219690@f6g2000cwb.googlegroups.com>
ferreira@unm.edu wrote:
> Please don't think that I am asking some one to do this for me!
Excuse me? Since you neglected to do so, allow me to quote your
original message.
>>>I would like to have my visitors sign up for text message updates. What
>>>I need is to have the form submit to a .txt file on the server. All I
>>>need is a list of phone numbers. Does anyone have the cgi or pl code
>>>that would accomplish this task?
That's not asking people to do this for you?
> I have
> spent hours trying to figure out why this simple script won't work.
And yet you chose not to post the script that you spent hours on, and
asked people to write one for you instead. That's asking people to do
it for you.
> This is what I have:
>
> #!/usr/bin/perl
You're not using strict. You're not using warnings. You're not using
the CGI module. WHY?
> use CGI::Carp qw(fatalsToBrowser);
> my $file = '/users/web/king/web/sms.txt';
> open (FILE, ">>" . $file) or die "cannot open file for appending: $!";
> flock (FILE, 2) or die "cannot lock file exclusively: $!";
> print "Content-type: text/html\n\n";
So if either of those system calls fails, you're going to die before
printing out the HTML header? Does that really make sense to you?
> print FILE $phone . "\n";
> close (FILE) or die "cannot close file: $!";
>
> And a sms.txt file
> All of the files are in the same place on the server
>
> The is the error log message:
> access:
> c-68-35-76-89.hsd1.nm.comcast.net - - [09/Jun/2006:17:28:30 +0000]
> "POST /sms.pl HTTP/1.1" 500 -
>
> error:
> [Fri Jun 9 17:28:30 2006] [error] [client 68.35.76.89] Premature end
> of script headers: /users/web/king/web/sms.pl
Now why do you think you might be getting that? Could it be that your
script is die()ing before printing out the headers?
What happened when you ran this script on the command line, as I asked
you to do in the previous message? Or did you ignore that part of my
post too?
> [Fri Jun 9 17:28:30 2006] [error] [client 68.35.76.89] File does not
> exist: /users/web/king/web/error500.html
This indicates that your webserver is misconfigured. It knows it
should be printing out the HTTP 500 message to the browser, but the
file containing that message does not exist. This, of course, has
nothing to do with Perl.
> I have also tried changing the .pl to .cgi but the result is the same.
What made you think that would solve the problem? It's possible your
webserver is configured to only execute Perl scripts ending in .cgi,
but if it is, that has nothing to do with Perl, and there was nothing
in the error log suggesting that to be the case.
Paul Lalli
------------------------------
Date: Sat, 10 Jun 2006 16:42:30 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: CGI for collecting phone numbers
Message-Id: <4f042qF1gp5veU1@individual.net>
Paul Lalli wrote:
> ferreira@unm.edu wrote:
>>
>>use CGI::Carp qw(fatalsToBrowser);
>>my $file = '/users/web/king/web/sms.txt';
>>open (FILE, ">>" . $file) or die "cannot open file for appending: $!";
>>flock (FILE, 2) or die "cannot lock file exclusively: $!";
>>print "Content-type: text/html\n\n";
>
> So if either of those system calls fails, you're going to die before
> printing out the HTML header? Does that really make sense to you?
To me it makes sense, since "fatalsToBrowser" is enabled.
>>error:
>>[Fri Jun 9 17:28:30 2006] [error] [client 68.35.76.89] Premature end
>>of script headers: /users/web/king/web/sms.pl
>
> Now why do you think you might be getting that? Could it be that your
> script is die()ing before printing out the headers?
No.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
------------------------------
Date: Sat, 10 Jun 2006 08:14:02 +0100
From: David Squire <David.Squire@no.spam.from.here.au>
Subject: Re: GIFS not working properly in JavaScript PopUps
Message-Id: <e6drfq$e81$1@news.ox.ac.uk>
Charleees wrote:
> Hi all,
>
> I have a button and when i click tha button it redirects to another
> page.....
>
> I have also added a java script for the button that makes a popup..
[much javascript snipped]
> how could i solve this problem... any alternate way..
>
> its urgent..please help...
You could start by asking your question in the right place, such as
comp.lang.javascript.
DS
------------------------------
Date: Fri, 9 Jun 2006 20:54:34 +0100
From: Ben Morrow <benmorrow@tiscali.co.uk>
Subject: Re: Merging potentially undefined hashes
Message-Id: <q80pl3-h82.ln1@osiris.mauzo.dyndns.org>
Quoth Ben Morrow <benmorrow@tiscali.co.uk>:
>
> Do you mean a 'Can't use and undefined value as a HASH reference' error?
> Please be precise.
Hoist. Petard.
I meant s/and/an/, of course :)
Ben
--
'Deserve [death]? I daresay he did. Many live that deserve death. And some die
that deserve life. Can you give it to them? Then do not be too eager to deal
out death in judgement. For even the very wise cannot see all ends.'
benmorrow@tiscali.co.uk
------------------------------
Date: 10 Jun 2006 06:27:07 -0700
From: "Paul Lalli" <mritty@gmail.com>
Subject: Re: Merging potentially undefined hashes
Message-Id: <1149946027.527801.323080@m38g2000cwc.googlegroups.com>
Derek Basch wrote:
> Ummmm... I was asking people for their ideas on what the proper idiom
> to handle this problem was. The code I wrote was simply to convey a
> concept because I knew that I was going about it the wrong way. Why
> test code that you know is wrong? Especially, when it is only
> pseudocode and you clearly state that it is such.
Uh. You have a different definition of either "clearly" or
"pseudo-code" than I do. Allow me to re-quote you:
> > > Here is my attempt at an idiom to handle this. Haven't tested it:
That, to you, is the equivalent of "this is pseudo-code"? Sounds to me
like it was an attempt to write real code, but you couldn't be bothered
to type "perl -c" and wanted us to do it for you.
> I don't understand why some people on this list are so angry?
Because your post amounted to "I'm lazy and don't feel like checking
this, so please do it for me."
> I wrote
> questions just like this on the python list for years and never once
> got yelled at.
Explain to me where you "got yelled at". Hell, there weren't even any
exclamation marks in my post. Or are you suggesting that anyone who
tells you you're doing something wrong is "yelling" at you?
> I searched long and hard for previous answers to this
> problem and found none. Only then did I post the question.
Uh-huh, and? I don't recall my post deriding you for not searching
long enough. I asked you why you would write what you believed to be a
way to solve the problem, but didn't even bother testing it. Even if
you knew it was "wrong", you don't think the compiler and run-time
results of such code might lead you to the right answer? Sounds like
you have remarkably little faith in your coding and debugging skills.
> Anyways, thanks to everyone for the answers.
You're welcome.
Paul Lalli
------------------------------
Date: 10 Jun 2006 03:29:45 -0700
From: "madan" <madan.narra@gmail.com>
Subject: Re: Still Error when connecting to database using Win32::ODBC
Message-Id: <1149935385.354599.171450@u72g2000cwu.googlegroups.com>
hi sir..
thanks for ur reply...my problem is solved..i changed the path to my
perl intrepeter from
"#! /usr/bin/prel " to "#! c:\perl\bin\perl";
thats the actual problem...
now it working fine...
------------------------------
Date: Sat, 10 Jun 2006 13:11:36 GMT
From: zentara <zentara@highstream.net>
Subject: Re: threads on XP-- system() works, backtic & popen dosen't...
Message-Id: <jqgl82hn9ds6h1i30gdf7c8cdv1mg2umib@4ax.com>
On 9 Jun 2006 11:48:32 -0700, "kdd21@hotmail.com" <kdd21@hotmail.com>
wrote:
>On windows however, only $EXEMODE = 0 works. The others both hang on
>the external.
>Any idea what's happening here? Windows pseudo-fork anomalies perhaps?
> Any other alternatives?
In case you don't find an answer here:
You might want to ask this on http://perlmonks.org
An active monk named BrowserUk is especially good at threads on Win32.
--
I'm not really a human, but I play one on earth.
http://zentara.net/japh.html
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 9278
***************************************