[23815] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 6018 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jan 29 20:31:29 2004

Date: Thu, 29 Jan 2004 17:25:50 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 29 Jan 2004     Volume: 10 Number: 6018

Today's topics:
        Language detection module.. <ar@e-mail.si>
    Re: Language detection module.. <usenet@morrow.me.uk>
    Re: Language detection module.. (J.B. Moreno)
    Re: Language detection module.. <tadmc@augustmail.com>
    Re: Language detection module.. <flavell@ph.gla.ac.uk>
    Re: Language detection module.. <martn.quensel@forumsyd.se>
    Re: Language detection module.. (J.B. Moreno)
    Re: Language detection module.. (Anno Siegel)
    Re: Language detection module.. <ewilhelm@somethinglike.sbcglobalDOTnet>
    Re: Language detection module.. <riechert@pobox.com>
    Re: Language detection module.. (Malcolm Dew-Jones)
    Re: Language detection module.. (Malcolm Dew-Jones)
    Re: Language detection module.. (Malcolm Dew-Jones)
    Re: Language detection module.. <Joe.Smith@inwap.com>
    Re: Language detection module.. <krahnj@acm.org>
    Re: Language detection module.. (Anno Siegel)
    Re: Language detection module.. (Anno Siegel)
    Re: Language detection module.. (Anno Siegel)
    Re: Language detection module.. <jurgenex@hotmail.com>
    Re: Language detection module.. <tadmc@augustmail.com>
    Re: Language detection module.. <ar@e-mail.si>
    Re: Language detection module.. <pkent77tea@yahoo.com.tea>
    Re: Language detection module.. <neil.shadrach@corryn.com>
    Re: Language detection module.. <pkent77tea@yahoo.com.tea>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 21 Jan 2004 17:21:24 +0100
From: AR <ar@e-mail.si>
Subject: Language detection module..
Message-Id: <pan.2004.01.21.16.21.24.395867@e-mail.si>

Does exist any module/script that can 100% detect text language..
for example English, German, French, ... (European languages, at least
English...)


------------------------------

Date: Wed, 21 Jan 2004 19:05:08 +0000 (UTC)
From: Ben Morrow <usenet@morrow.me.uk>
Subject: Re: Language detection module..
Message-Id: <bumih4$9ei$3@wisteria.csv.warwick.ac.uk>


AR <ar@e-mail.si> wrote:
> Does exist any module/script that can 100% detect text language..
> for example English, German, French, ... (European languages, at least
> English...)

100%? No. What language is this string: "hotel"?

Ben

-- 
  Joy and Woe are woven fine,
  A Clothing for the Soul divine       William Blake
  Under every grief and pine          'Auguries of Innocence'
  Runs a joy with silken twine.                                ben@morrow.me.uk


------------------------------

Date: Wed, 21 Jan 2004 15:04:44 -0500
From: planB@newsreaders.com (J.B. Moreno)
Subject: Re: Language detection module..
Message-Id: <1g7x2v2.m0x8omuxqhmcN%planB@newsreaders.com>

Ben Morrow <usenet@morrow.me.uk> wrote:

> AR <ar@e-mail.si> wrote:
> > Does exist any module/script that can 100% detect text language..
> > for example English, German, French, ... (European languages, at least
> > English...)
> 
> 100%? No. What language is this string: "hotel"?

Swahili?

-- 
JBM
"Everything is futile." -- Marvin of Borg


------------------------------

Date: Wed, 21 Jan 2004 14:57:42 -0600
From: Tad McClellan <tadmc@augustmail.com>
Subject: Re: Language detection module..
Message-Id: <slrnc0tpu6.7m0.tadmc@magna.augustmail.com>

Ben Morrow <usenet@morrow.me.uk> wrote:
> 
> AR <ar@e-mail.si> wrote:
>> Does exist any module/script that can 100% detect text language..
>> for example English, German, French, ... (European languages, at least
>> English...)
> 
> 100%? No. What language is this string: "hotel"?


Military?  (the letter "H") ?


-- 
    Tad McClellan                          SGML consulting
    tadmc@augustmail.com                   Perl programming
    Fort Worth, Texas


------------------------------

Date: Wed, 21 Jan 2004 20:18:46 +0000
From: "Alan J. Flavell" <flavell@ph.gla.ac.uk>
Subject: Re: Language detection module..
Message-Id: <Pine.LNX.4.53.0401212016010.15815@ppepc56.ph.gla.ac.uk>

On Wed, 21 Jan 2004, Ben Morrow wrote:

> 100%? No. What language is this string: "hotel"?

Yeah, ask a German speaker what language this is: "Gift".


------------------------------

Date: Wed, 21 Jan 2004 23:30:31 +0100
From: Martin Quensel <martn.quensel@forumsyd.se>
Subject: Re: Language detection module..
Message-Id: <J3DPb.891$O41.14441@amstwist00>

J.B. Moreno wrote:
> Ben Morrow <usenet@morrow.me.uk> wrote:
> 
> 
>>AR <ar@e-mail.si> wrote:
>>
>>>Does exist any module/script that can 100% detect text language..
>>>for example English, German, French, ... (European languages, at least
>>>English...)
>>
>>100%? No. What language is this string: "hotel"?
> 
> 
> Swahili?
Start by adding all words from all the dictionaries in the world in a file.
Then using statistics you get the most likely one.

or why not just?

#!/usr/bin/perl -w

print "String is in any known language or some constructed language such 
as Esperanto, Volapuk, Glosa, Loglan, or even klingon.\n";


Now that would almost certainly cover 95% of all the languages (missed 
adding the tolkien languages, but i leave that as a programmin excercise 
). But im not sure if its 100% future proof. The "any known language" 
could be interpreted as "known" to the person running the program.

Best Regards
Martin Quensel



------------------------------

Date: Thu, 22 Jan 2004 01:35:12 -0500
From: planB@newsreaders.com (J.B. Moreno)
Subject: Re: Language detection module..
Message-Id: <1g7xrak.1fx475bohw08zN%planB@newsreaders.com>

Martin Quensel <martn.quensel@forumsyd.se> wrote:

> J.B. Moreno wrote:
> > Ben Morrow <usenet@morrow.me.uk> wrote:
> > 
> >>AR <ar@e-mail.si> wrote:
> >>
> >>>Does exist any module/script that can 100% detect text language..
> >>>for example English, German, French, ... (European languages, at least
> >>>English...)
> >>
> >>100%? No. What language is this string: "hotel"?
> > 
> > Swahili?
>
> Start by adding all words from all the dictionaries in the world in a
> file. Then using statistics you get the most likely one.

The phrases "100%" and "most likely one" aren't equivalent.

And look up the James Nicoll quote on the purity of the english
language.

-- 
JBM
"Everything is futile." -- Marvin of Borg


------------------------------

Date: 22 Jan 2004 11:49:54 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: Language detection module..
Message-Id: <buodd2$6t8$6@mamenchi.zrz.TU-Berlin.DE>

Ben Morrow  <usenet@morrow.me.uk> wrote in comp.lang.perl.misc:
> 
> AR <ar@e-mail.si> wrote:
> > Does exist any module/script that can 100% detect text language..
> > for example English, German, French, ... (European languages, at least
> > English...)
> 
> 100%? No. What language is this string: "hotel"?

Well, one-word-samples are hard, and 100% is unattainable.

Entirely off topic, I have recently heard of an approach to text
classification (with an eye to language recognition) that I found
interesting.

Use a Ziv-Lempel-like method to compress your sample.  Then concatenate
it with texts of similar lengths taken from known languages and compress
again.  If the compression rate is similar or better than that of the
original text, the appended text is similar to the original one.  If
the compression deteriorates, the texts are dissimilar.

The source (some idle chat on IRC, sorry) said that this works for
rather small samples of fewer than a hundred words.  I have always been
meaning to play with it, but haven't got around.

Anno


------------------------------

Date: Thu, 22 Jan 2004 12:01:01 GMT
From: Eric Wilhelm <ewilhelm@somethinglike.sbcglobalDOTnet>
Subject: Re: Language detection module..
Message-Id: <pan.2004.01.22.06.03.46.693408.5383@somethinglike.sbcglobalDOTnet>

On Thu, 22 Jan 2004 00:35:12 -0600, J.B. Moreno wrote:

>> Start by adding all words from all the dictionaries in the world in a
>> file. Then using statistics you get the most likely one.
> 
> The phrases "100%" and "most likely one" aren't equivalent

This is true, but in the real world, something which gives a 99.9%
probability is about as good as we are going to get.  No sense in
refusing to use a circle simply because it is impossible to make a
perfect one.

IMO, 99.9% might be a low estimate even if the program takes a naive
approach.  If the dictionaries include "adopted" phrases (e.g. Latin
expressions which are often cited in English, etc.) and some kind of
best-fit spell check is used, you might push the probabilities into
99.99%.  Now feed some works of literature from each language into a
phrase-counter and use phrases as well, and you might find that a text of
100 words or more can be predicted correctly 99.9999% of the time.

If that isn't good enough (missing 1 of 10^6), you're going to be working
on the thing for so long that half of the languages in use at its
conception are out of use before you reach the prototype.

--Eric


------------------------------

Date: 22 Jan 2004 21:44:29 +0100
From: Andreas Marcel Riechert <riechert@pobox.com>
Subject: Re: Language detection module..
Message-Id: <m3zncf67de.fsf@tairou.japanologie.kultur.uni-tuebingen.de>

anno4000@lublin.zrz.tu-berlin.de (Anno Siegel) writes:

> Entirely off topic, I have recently heard of an approach to text
> classification (with an eye to language recognition) that I found
> interesting.
> 
> Use a Ziv-Lempel-like method to compress your sample.  Then concatenate
> it with texts of similar lengths taken from known languages and compress
> again.  If the compression rate is similar or better than that of the
> original text, the appended text is similar to the original one.  If
> the compression deteriorates, the texts are dissimilar.
> 
> The source (some idle chat on IRC, sorry) said that this works for
> rather small samples of fewer than a hundred words.  I have always been
> meaning to play with it, but haven't got around.


Probably:
Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto
"Language Trees and Zipping". in: Physical Review Letters, January 28, 2002.
(http://link.aps.org/abstract/PRL/v88/e048702)

http://arxiv.org/abs/cond-mat/0108530   (The Paper)
http://arxiv.org/abs/cond-mat/0202383   (One very critical answer)

Cheers,

Andreas





------------------------------

Date: 22 Jan 2004 10:43:36 -0800
From: yf110@vtn1.victoria.tc.ca (Malcolm Dew-Jones)
Subject: Re: Language detection module..
Message-Id: <401019d8@news.victoria.tc.ca>

Ben Morrow (usenet@morrow.me.uk) wrote:

: AR <ar@e-mail.si> wrote:
: > Does exist any module/script that can 100% detect text language..
: > for example English, German, French, ... (European languages, at least
: > English...)

: 100%? No. What language is this string: "hotel"?

I can say with 100% certainty that that is an english word.


------------------------------

Date: 22 Jan 2004 10:45:10 -0800
From: yf110@vtn1.victoria.tc.ca (Malcolm Dew-Jones)
Subject: Re: Language detection module..
Message-Id: <40101a36@news.victoria.tc.ca>

J.B. Moreno (planB@newsreaders.com) wrote:
: Martin Quensel <martn.quensel@forumsyd.se> wrote:

: > J.B. Moreno wrote:
: > > Ben Morrow <usenet@morrow.me.uk> wrote:
: > > 
: > >>AR <ar@e-mail.si> wrote:
: > >>
: > >>>Does exist any module/script that can 100% detect text language..
: > >>>for example English, German, French, ... (European languages, at least
: > >>>English...)
: > >>
: > >>100%? No. What language is this string: "hotel"?
: > > 
: > > Swahili?
: >
: > Start by adding all words from all the dictionaries in the world in a
: > file. Then using statistics you get the most likely one.

: The phrases "100%" and "most likely one" aren't equivalent.

: And look up the James Nicoll quote on the purity of the english
: language.

Every language is 100% pure all the time - they are moving targets defined
by their own use.


------------------------------

Date: 22 Jan 2004 10:53:24 -0800
From: yf110@vtn1.victoria.tc.ca (Malcolm Dew-Jones)
Subject: Re: Language detection module..
Message-Id: <40101c24@news.victoria.tc.ca>

Anno Siegel (anno4000@lublin.zrz.tu-berlin.de) wrote:
: Ben Morrow  <usenet@morrow.me.uk> wrote in comp.lang.perl.misc:
: > 
: > AR <ar@e-mail.si> wrote:
: > > Does exist any module/script that can 100% detect text language..
: > > for example English, German, French, ... (European languages, at least
: > > English...)
: > 
: > 100%? No. What language is this string: "hotel"?

: Well, one-word-samples are hard, and 100% is unattainable.

: Entirely off topic, I have recently heard of an approach to text
: classification (with an eye to language recognition) that I found
: interesting.

: Use a Ziv-Lempel-like method to compress your sample.  Then concatenate
: it with texts of similar lengths taken from known languages and compress
: again.  If the compression rate is similar or better than that of the
: original text, the appended text is similar to the original one.  If
: the compression deteriorates, the texts are dissimilar.

: The source (some idle chat on IRC, sorry) said that this works for
: rather small samples of fewer than a hundred words.  I have always been
: meaning to play with it, but haven't got around.

: Anno

Sounds reasonable, basically it would be testing for similarity of letter 
sequences.

I might also suggest using a bayesian filter such as ifile or similar.  
They try to file each message into the correct one (of many) folder. (I've
nevr used ifile, just read of it.)

You would provide samples in the languages you anticipate and then let the
filter categorize each document.

$0.02


------------------------------

Date: Thu, 22 Jan 2004 19:41:56 GMT
From: Joe Smith <Joe.Smith@inwap.com>
Subject: Re: Language detection module..
Message-Id: <8IVPb.123505$I06.964886@attbi_s01>

Malcolm Dew-Jones wrote:

> Ben Morrow (usenet@morrow.me.uk) wrote:
> 
> : AR <ar@e-mail.si> wrote:
> : > Does exist any module/script that can 100% detect text language..
> : > for example English, German, French, ... (European languages, at least
> : > English...)
> 
> : 100%? No. What language is this string: "hotel"?
> 
> I can say with 100% certainty that that is an english word.

Taxi!


------------------------------

Date: Thu, 22 Jan 2004 20:59:48 GMT
From: "John W. Krahn" <krahnj@acm.org>
Subject: Re: Language detection module..
Message-Id: <401039A7.661FC7@acm.org>

Joe Smith wrote:
> 
> Malcolm Dew-Jones wrote:
> 
> > Ben Morrow (usenet@morrow.me.uk) wrote:
> >
> > : AR <ar@e-mail.si> wrote:
> > : > Does exist any module/script that can 100% detect text language..
> > : > for example English, German, French, ... (European languages, at least
> > : > English...)
> >
> > : 100%? No. What language is this string: "hotel"?
> >
> > I can say with 100% certainty that that is an english word.
> 
> Taxi!

Beer!


John
-- 
use Perl;
program
fulfillment


------------------------------

Date: 22 Jan 2004 21:02:35 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: Language detection module..
Message-Id: <bupdpb$2a0$1@mamenchi.zrz.TU-Berlin.DE>

Malcolm Dew-Jones <yf110@vtn1.victoria.tc.ca> wrote in comp.lang.perl.misc:
> Anno Siegel (anno4000@lublin.zrz.tu-berlin.de) wrote:
> : Ben Morrow  <usenet@morrow.me.uk> wrote in comp.lang.perl.misc:
> : > 
> : > AR <ar@e-mail.si> wrote:
> : > > Does exist any module/script that can 100% detect text language..
> : > > for example English, German, French, ... (European languages, at least
> : > > English...)
> : > 
> : > 100%? No. What language is this string: "hotel"?
> 
> : Well, one-word-samples are hard, and 100% is unattainable.
> 
> : Entirely off topic, I have recently heard of an approach to text
> : classification (with an eye to language recognition) that I found
> : interesting.
> 
> : Use a Ziv-Lempel-like method to compress your sample.  Then concatenate
> : it with texts of similar lengths taken from known languages and compress
> : again.  If the compression rate is similar or better than that of the
> : original text, the appended text is similar to the original one.  If
> : the compression deteriorates, the texts are dissimilar.
> 
> : The source (some idle chat on IRC, sorry) said that this works for
> : rather small samples of fewer than a hundred words.  I have always been
> : meaning to play with it, but haven't got around.
> 
> : Anno
> 
> Sounds reasonable, basically it would be testing for similarity of letter 
> sequences.

That's the idea.  Trouble is, it would cost quite some research on how
co-compressibility actually varies with text samples to tune the parameters
you need to make decisions.  That's what's stopping me from "playing"
with it, I'll leave that for someone with a diploma in statistics (or
the need for one).

> I might also suggest using a bayesian filter such as ifile or similar.  

Yes, it came up as an alternative in a discussion of bayesian spam filters.

> They try to file each message into the correct one (of many) folder. (I've
> nevr used ifile, just read of it.)
> 
> You would provide samples in the languages you anticipate and then let the
> filter categorize each document.

All these treat the problem of language identification as a case of
general text classification.  Specific methods may apply, such as testing
for frequent key words in each language.  For some (inflecting) languages,
an analysis of word endings may be highly distinctive.  And so on, since
we're off topic.  If these fail, one might decide not to decide, or fall
back on more expensive text classification, a la above.

Anno


------------------------------

Date: 22 Jan 2004 21:25:57 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: Language detection module..
Message-Id: <bupf55$2vs$1@mamenchi.zrz.TU-Berlin.DE>

Andreas Marcel Riechert  <riechert@pobox.com> wrote in comp.lang.perl.misc:
> anno4000@lublin.zrz.tu-berlin.de (Anno Siegel) writes:
> 
> > Entirely off topic, I have recently heard of an approach to text
> > classification (with an eye to language recognition) that I found
> > interesting.
> > 
> > Use a Ziv-Lempel-like method to compress your sample.  Then concatenate
> > it with texts of similar lengths taken from known languages and compress
> > again.  If the compression rate is similar or better than that of the
> > original text, the appended text is similar to the original one.  If
> > the compression deteriorates, the texts are dissimilar.
> > 
> > The source (some idle chat on IRC, sorry) said that this works for
> > rather small samples of fewer than a hundred words.  I have always been
> > meaning to play with it, but haven't got around.
> 
> 
> Probably:
> Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto
> "Language Trees and Zipping". in: Physical Review Letters, January 28, 2002.
> (http://link.aps.org/abstract/PRL/v88/e048702)
> 
> http://arxiv.org/abs/cond-mat/0108530   (The Paper)
> http://arxiv.org/abs/cond-mat/0202383   (One very critical answer)

I believe those names were mentioned, thanks for the reference.

I didn't read the papers yet, but I notice that part of the reply is
about the article being off topic in Physical Review Letters.  Some
things won't change, no matter what the medium...

Anno


------------------------------

Date: 22 Jan 2004 21:29:51 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: Language detection module..
Message-Id: <bupfcf$2vs$2@mamenchi.zrz.TU-Berlin.DE>

John W. Krahn <krahnj@acm.org> wrote in comp.lang.perl.misc:
> Joe Smith wrote:
> > Malcolm Dew-Jones wrote:
> > > Ben Morrow (usenet@morrow.me.uk) wrote:
> > >
> > > > "hotel"
> > 
> > Taxi!
> 
> Beer!

Now reverse.

Anno.


------------------------------

Date: Fri, 23 Jan 2004 03:27:32 GMT
From: "J�rgen Exner" <jurgenex@hotmail.com>
Subject: Re: Language detection module..
Message-Id: <Ew0Qb.14280$h77.2560@nwrddc02.gnilink.net>

Malcolm Dew-Jones wrote:
> Ben Morrow (usenet@morrow.me.uk) wrote:
>
>> AR <ar@e-mail.si> wrote:
>>> Does exist any module/script that can 100% detect text language..
>>> for example English, German, French, ... (European languages, at
>>> least English...)
>
>> 100%? No. What language is this string: "hotel"?
>
> I can say with 100% certainty that that is an english word.

If you would have said "It is a word of the English language", then would
have concured.
However, "an English word"? No.

jue




------------------------------

Date: Thu, 22 Jan 2004 19:21:42 -0600
From: Tad McClellan <tadmc@augustmail.com>
Subject: Re: Language detection module..
Message-Id: <slrnc10tp6.bef.tadmc@magna.augustmail.com>

Anno Siegel <anno4000@lublin.zrz.tu-berlin.de> wrote:
> John W. Krahn <krahnj@acm.org> wrote in comp.lang.perl.misc:
>> Joe Smith wrote:
>> > Malcolm Dew-Jones wrote:
>> > > Ben Morrow (usenet@morrow.me.uk) wrote:
>> > >
>> > > > "hotel"
>> > 
>> > Taxi!
>> 
>> Beer!
> 
> Now reverse.


What is going on here?

I've heard of "scalar context".

I've heard of "list context".

What is this "silly context" that seems to have taken over this thread?


-- 
    Tad McClellan                          SGML consulting
    tadmc@augustmail.com                   Perl programming
    Fort Worth, Texas


------------------------------

Date: Fri, 23 Jan 2004 11:25:58 +0100
From: AR <ar@e-mail.si>
Subject: Re: Language detection module..
Message-Id: <pan.2004.01.23.10.25.49.506871@e-mail.si>

I had some system maintaince.. 

Text are long at least 15Kb up to some megs. It shouldn't be so hard to
find out in which language is text, because it is long enought. About 70%
of texts are labeled in which lnguage they are, rest are not. But among
remaining 30% there is at least 90% English, some in German, few French
and few Italian and maybe some Duch, Chinese, .. Russian.. At least I need
tool that will filter out English texts as accurate as possible.

So I think that some kind of statistical approach would be fine...

About 100%.. yes.. I know it is impossible...

On Wed, 21 Jan 2004 17:21:24 +0100, AR wrote:

> Does exist any module/script that can 100% detect text language..
> for example English, German, French, ... (European languages, at least
> English...)



------------------------------

Date: Sat, 24 Jan 2004 13:39:20 +0000
From: pkent <pkent77tea@yahoo.com.tea>
Subject: Re: Language detection module..
Message-Id: <pkent77tea-70C3CD.13392024012004@pth-usenet-01.plus.net>

In article <pan.2004.01.23.10.25.49.506871@e-mail.si>,
 AR <ar@e-mail.si> wrote:

> Text are long at least 15Kb up to some megs. It shouldn't be so hard to
> find out in which language is text, because it is long enought. About 70%
> of texts are labeled in which lnguage they are, rest are not. But among
> remaining 30% there is at least 90% English, some in German, few French
> and few Italian and maybe some Duch, Chinese, .. Russian.. At least I need
> tool that will filter out English texts as accurate as possible.

[Apologies if I missed a relevant posting but...] How are these texts 
encoded on disk? Are they pure text, are they in Unicode (utf8, utf16, 
whatever), are they in a variety of encodings, depending on what 
language they're representing (e.g. iso8859-1 for French, shift-jis for 
Japanese, Big5(?) for Chinese, etc)... or are they in some structured 
format like HTML?

Also do you know what languages might occur, or might these texts be in 
any language? It must be helpful if you've got a list and the text must 
be in one of those listed languages.

Personally I'd convert the content and dictionaries  to Unicode to start 
with.
You might get some mileage from looking at the use of characters withing 
Unicode blocks first before the expensive operation of searching for 
words in dictionaries. E.g.
If there is very little use of accented characters then it's less likely 
to be French or German.
If it uses the w-circumflex character a fair bit it could be Welsh
The n-tilde might indicate Spanish
C-cedilla might indicate French
If it uses mainly characters from the Cyrillic block of Unicode then it 
could be Russian, or one of the few languages that uses Cyrillic.
If it uses mainly Greek characters then...

You could do this kind of analysis quite cheaply, and then end up with a 
load of scores for each language. In some cases you'd get a clear enough 
score for, say, Greek, so you'd class it as Greek. In other cases you 
might say "OK I think it's X _or_ Y, but I'll have to check the 
dictionaries now to get a better answer".

P

-- 
pkent 77 at yahoo dot, er... what's the last bit, oh yes, com
Remove the tea to reply


------------------------------

Date: Mon, 26 Jan 2004 08:58:07 +0000 (UTC)
From: Neil Shadrach <neil.shadrach@corryn.com>
Subject: Re: Language detection module..
Message-Id: <bv2kqv$m91$1@visp.bt.co.uk>

pkent:
[ other wise advice snipped ]
> If it uses the w-circumflex character a fair bit it could be Welsh

Though at this point in time the accent is often omitted because it has not been well-spported in the past.
I'm sure the same is true for other languages ( eg Poles using "l" in place of "l with stroke" )
As far as Welsh is concerned the presence of the particles "yn" and "yr" is a
pretty good discriminator. You'd be pretty hard pressed to find a Welsh text without them and they don't
appear to occur in other languages [1]. Do other languages have good discriminators? I presume groups of
related languages would be a problem.



[1] I'm basing this on Google rather than academic rigour :)



------------------------------

Date: Tue, 27 Jan 2004 22:52:15 +0000
From: pkent <pkent77tea@yahoo.com.tea>
Subject: Re: Language detection module..
Message-Id: <pkent77tea-3C9E46.22521527012004@pth-usenet-01.plus.net>

In article <bv2kqv$m91$1@visp.bt.co.uk>,
 Neil Shadrach <neil.shadrach@corryn.com> wrote:
> pkent:
> [ other wise advice snipped ]
> > If it uses the w-circumflex character a fair bit it could be Welsh
> 
> Though at this point in time the accent is often omitted because it has not 
> been well-spported in the past.

Also very true. I've noticed other chars that used to be better 
supported on windows than Macs - possibly thorn and/or eth, but I 
haven't got the code charts to hand to check. Anyway, roll on more and 
more use of unicode, and possibly more and more characters _in_ Unicode 
:-)

> [1] I'm basing this on Google rather than academic rigour :)

Aahh, that's good enough :-)

P

-- 
pkent 77 at yahoo dot, er... what's the last bit, oh yes, com
Remove the tea to reply


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 6018
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[23815] in Perl-Users-Digest

Perl-Users Digest, Issue: 6018 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Thu Jan 29 20:31:29 2004

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jan 29 20:31:29 2004