[32762] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 4026 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Sep 5 16:09:39 2013

Date: Thu, 5 Sep 2013 13:09:04 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 5 Sep 2013     Volume: 11 Number: 4026

Today's topics:
    Re: Cannot have locale word characters in a variable <hjp-usenet3@hjp.at>
    Re: Cannot have locale word characters in a variable <hjp-usenet3@hjp.at>
    Re: File deduplication <gravitalsun@hotmail.foo>
    Re: File deduplication <mvdwege@mail.com>
    Re: File deduplication <gravitalsun@hotmail.foo>
    Re: File deduplication <rweikusat@mobileactivedefense.com>
    Re: File deduplication <gravitalsun@hotmail.foo>
    Re: whats the hardest part of making a web app? <nospam@lisse.NA>
    Re: whats the hardest part of making a web app? <vilain@NOspamcop.net>
    Re: whats the hardest part of making a web app? <cartercc@gmail.com>
    Re: whats the hardest part of making a web app? <xhoster@gmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 4 Sep 2013 12:43:50 +0200
From: "Peter J. Holzer" <hjp-usenet3@hjp.at>
Subject: Re: Cannot have locale word characters in a variable
Message-Id: <slrnl2e3n6.5fm.hjp-usenet3@hrunkner.hjp.at>

On 2013-09-02 23:50, Ben Morrow <ben@morrow.me.uk> wrote:
> Quoth klaus03 <klaus03@gmail.com>:
>> Le 02/09/2013 22:40, Ben Morrow a crit :
>> >
>> > If you want de_DE rather than Unicode \w semantics
>> 
>> de_DE semantics is probably not needed, the usual Unicode semantics of 
>> \w should by default include all German umlauts + other special German 
>> characters.
>
> Yes. However, Unicode will include (for example) non-Latin letter
> characters as letters, which I would not expect a German locale to do.

Your expectation would be wrong on Linux (at least with glibc 2.11-2.13).
I've tested various locales and AFAICS all of them except C and POSIX
use the unicode semantics for wide characters. 

Here's a test program in C:

---8<------8<------8<------8<------8<------8<------8<------8<------8<---
#include <locale.h>
#include <stdio.h>
#include <wctype.h>

int main(void) {
    setlocale(LC_ALL, "");
    wint_t c[] = {
        0x30, 0x41, 0xD8, 0x03B1, 0x304B, 0x65e0
    };
    int n = sizeof(c) / sizeof(c[0]);
    for (int i = 0; i < n; i++) {
        printf("%04x", c[i]);
        printf(" %s", iswalpha(c[i]) ? "alpha" : "-----");
        printf(" %s", iswdigit(c[i]) ? "digit" : "-----");
        printf("\n");
    }
    return 0;
}
---8<------8<------8<------8<------8<------8<------8<------8<------8<---

>> > and you need to call setlocale and either 'use locale' or use the
>> > /l regex flag.
>> 
>> That's not necessarily needed:
>> 
>> My understanding is that Unicode takes precedence over any locales.
>> 
>> However, you might have to call setlocale, 'use locale' or /l regex 
>> flag, but only if you don't have Unicode semantics (that is: only if 
>> your perl is older than 5.014)
>
> Your understanding is out of date. Up until 5.12, whether regexes
> matched with Unicode, ISO8859-1 or locale semantics was rather
> unpredictable, though in general if either the pattern or the string was
> Unicode then Unicode rules were used. In 5.12 the unpredictability was
> fixed, so Unicode semantics were (IIRC) always used. 

Really always or only if the unicode_strings feature is used? I would
expect such a change to break rather a lot of code.

	hp


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) |                    | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpat. -- Ralph Babel


------------------------------

Date: Wed, 4 Sep 2013 15:14:45 +0200
From: "Peter J. Holzer" <hjp-usenet3@hjp.at>
Subject: Re: Cannot have locale word characters in a variable
Message-Id: <slrnl2eci6.5fm.hjp-usenet3@hrunkner.hjp.at>

On 2013-09-03 05:21, Charles DeRykus <derykus@gmail.com> wrote:
> On 9/2/2013 1:08 PM, Peter J. Holzer wrote:
>> On 2013-09-02 19:45, Charles DeRykus <derykus@gmail.com> wrote:
>>> On 9/2/2013 10:34 AM, fmassion@web.de wrote:
>>>> My test file:
>>>>
>>>> höheneinstellbar 1234
>>>> bedienbar 5678
>>>> 1111 Müller
>>>> größer 8765
>>
>> Which character encoding does the file use?
>>
>>
>>>> My script:
>>>> #!/usr/bin/perl -w
>>>> use locale;
>>>> open(FILE,'test.txt') ;
>>>> @sentence = <FILE>;
>>>> foreach $sentence (@sentence) {
>>>> 	chomp $sentence;
>>>> if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
>>>>      print  "$1\n";
>>>> }}
[...]
> With 'use locale' plus 'binmode(STDOUT,":utf8")',  there is correct 
> output

Since you are using UTF-8 on the terminal I am assuming that your
test.txt is encoded in UTF-8, too (This may or may not be true for the
OP: AFAICS he hasn't answered that question yet).

I don't see how there can be correct output in this case. “use locale”
doesn't affect open, so the file will be read as a byte stream. 

The first line is then "h\303\266heneinstellbar 1234". "\266" isn't a
word character in any locale AFAIK, so the regexp will match
"heneinstellbar 1234", which  is wrong.

Even if it did match the whole line, writing the string to a stream with
the utf8 layer results in encoding the already UTF-8-encoded string a
second time, so the result is "h\303\203\302\266heneinstellbar 1234" or
"höheneinstellbar 1234", which is also not correct.

(this is for Perl 5.14. Maybe something changed after that, but I doubt
it)

> IIUC doesn't Perl internally store as Latin-1,eg, and seamlessly upgrade 
> to Unicode as needed..

You shouldn't care about how perl stores strings internally.

> It seems clunky then to nail down the input encoding as well

You always[1] need to decode on input to convert from a sequence of
bytes to a sequence of characters. Only for Latin-1 this is an identity
mapping. If you don't specify the encoding, Perl can't know it (it can't
just assume that all files are text files in the current locale's
encoding: They might use a different one or not be text at all).

	hp

[1] Not quite: Sometimes it is better to process text files as a byte
    stream, but that's rare in my experience. As a rule of thunmb,
    always decode on input and always encode on output.

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) |                    | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


------------------------------

Date: Wed, 04 Sep 2013 02:24:08 +0300
From: George Mpouras <gravitalsun@hotmail.foo>
Subject: Re: File deduplication
Message-Id: <l05r34$15v9$1@news.ntua.gr>


>
> If the OP just drops links to his site, report him for spam. Otherwise I
> suggest you use a kill file.
>

What else can I say after that, "Please donate me 10 boxes" !


------------------------------

Date: Wed, 04 Sep 2013 08:56:42 +0200
From: Mart van de Wege <mvdwege@mail.com>
Subject: Re: File deduplication
Message-Id: <86sixlt6qd.fsf@gaheris.avalon.lan>

George Mpouras <gravitalsun@hotmail.foo> writes:

>
>  I do not "think" 

That much is obvious.

-- 
"We will need a longer wall when the revolution comes."
    --- AJS, quoting an uncertain source.


------------------------------

Date: Wed, 04 Sep 2013 11:59:26 +0300
From: George Mpouras <gravitalsun@hotmail.foo>
Subject: Re: File deduplication
Message-Id: <l06sps$t83$1@news.ntua.gr>

Στις 3/9/2013 17:36, ο/η Jürgen Exner έγραψε:
> if ($#dirs == -1)
>
> You must be kidding....




# which is the less funny ?


my @array;

print "Array is blank\n" if (
(0  == scalar @array) ||
(0  == @array) ||
(-1 == $#array)
);


------------------------------

Date: Wed, 04 Sep 2013 10:37:30 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: File deduplication
Message-Id: <87zjrtszad.fsf@sapphire.mobileactivedefense.com>

George Mpouras <gravitalsun@hotmail.foo> writes:
> Στις 3/9/2013 17:36, ο/η Jürgen Exner έγραψε:
>> if ($#dirs == -1)
>>
>> You must be kidding....
>
>
>
>
> # which is the less funny ?
>
> my @array;
>
> print "Array is blank\n" if (
> (0  == scalar @array) ||
> (0  == @array) ||
> (-1 == $#array)
> );

The least funny would be

print "Array is blank" unless @array;

which can also be written as

@array or print "Where did all the flowers go?";

The inverted comparisons are also just bizarre if none of the
operators is an lvalue because then, an accidental assignment will
result in an error either way.


------------------------------

Date: Wed, 04 Sep 2013 13:28:28 +0300
From: George Mpouras <gravitalsun@hotmail.foo>
Subject: Re: File deduplication
Message-Id: <l0720p$1cn5$1@news.ntua.gr>

>
> @array or print "Where did all the flowers go?";


the @array is also faster than scalar @array , interesting





use Benchmark;
my @array;
my $results = Benchmark::timethese(5_000_000, {
method1	=> sub{ @array        ? 1 : 0 },
method2	=> sub{ scalar @array ? 1 : 0 }});
Benchmark::cmpthese($results);



------------------------------

Date: Wed, 04 Sep 2013 16:50:56 +0200
From: Dr Eberhard Lisse <nospam@lisse.NA>
Subject: Re: whats the hardest part of making a web app?
Message-Id: <522748D0.1080805@lisse.NA>

No, it doesn't :-)-O

The hardest part is to get it to work :-)-O

el


on 2013-08-31 09:25 Nick Wedd said the following:
[...]
> It depends what you want your web app to do.
> 
> Nick
> 



------------------------------

Date: Wed, 04 Sep 2013 11:40:48 -0700
From: Michael Vilain <vilain@NOspamcop.net>
Subject: Re: whats the hardest part of making a web app?
Message-Id: <vilain-C53F91.11404704092013@news.individual.net>

In article <522748D0.1080805@lisse.NA>,
 Dr Eberhard Lisse <nospam@lisse.NA> wrote:

> No, it doesn't :-)-O
> 
> The hardest part is to get it to work :-)-O
> 
> el
> 
> 
> on 2013-08-31 09:25 Nick Wedd said the following:
> [...]
> > It depends what you want your web app to do.
> > 
> > Nick
> > 

Isn't the hardest part getting it to work on Internet Explorer?

http://www.vilain.com/web-design.html

-- 
DeeDee, don't press that button!  DeeDee!  NO!  Dee...
[I filter all Goggle Groups posts, so any reply may be automatically ignored]




------------------------------

Date: Wed, 4 Sep 2013 19:19:52 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: whats the hardest part of making a web app?
Message-Id: <61a505dd-1650-439a-a61e-59a1ae40c438@googlegroups.com>

On Thursday, August 15, 2013 11:42:54 PM UTC-4, johannes falcone wrote:
> wondering?

It's either

(1) building the database, which includes designing the information architecture.
(2) building the user interface, which includes avoiding all the bells and whistles (like Flash and AJAX) that in many cases detract from the app.
(3) building the business rules, the glue that sticks the database to the UI.

Remember, getting a suitable requirements specification takes fifty percent of the time and effort in building any software, and testing the software takes the other fifty percent. If you get the requirements and testing down, the rest is a piece of cake!

CC.


------------------------------

Date: Wed, 04 Sep 2013 20:00:09 -0700
From: Xho Jingleheimerschmidt <xhoster@gmail.com>
Subject: Re: whats the hardest part of making a web app?
Message-Id: <l08s48$4ke$1@dont-email.me>

On 09/04/13 19:19, ccc31807 wrote:
> On Thursday, August 15, 2013 11:42:54 PM UTC-4, johannes falcone
> wrote:
>> wondering?
>
> It's either
>
> (1) building the database, which includes designing the information
> architecture. (2) building the user interface, which includes
> avoiding all the bells and whistles (like Flash and AJAX) that in
> many cases detract from the app. (3) building the business rules, the
> glue that sticks the database to the UI.

In the case of the current originator, I think the hardest part is to 
stop trolling usenet for long enough to actually sit down and do some work.

>
> Remember, getting a suitable requirements specification takes fifty
> percent of the time and effort in building any software, and testing
> the software takes the other fifty percent. If you get the
> requirements and testing down, the rest is a piece of cake!

And the other-other fifty percent is figuring out which of the 7 
competing requirement specification is the correct one.

Xho
-- 
90% of the work takes up 90% of the time.  The other 10% of the work 
takes up the other 90% of the time.


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 4026
***************************************


home help back first fref pref prev next nref lref last post