[30247] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 1490 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Apr 28 16:25:08 2008

Date: Mon, 28 Apr 2008 13:24:56 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Mon, 28 Apr 2008     Volume: 11 Number: 1490

Today's topics:
    Re: WWW::Mechanize doesn't always follow_link(text <RedGrittyBrick@SpamWeary.foo>
    Re: WWW::Mechanize doesn't always follow_link(text <szrRE@szromanMO.comVE>
    Re: WWW::Mechanize doesn't always follow_link(text <m@rtij.nl.invlalid>
    Re: WWW::Mechanize doesn't always follow_link(text <szrRE@szromanMO.comVE>
    Re: WWW::Mechanize doesn't always follow_link(text <m@rtij.nl.invlalid>
    Re: WWW::Mechanize doesn't always follow_link(text <spamtrap@dot-app.org>
    Re: WWW::Mechanize doesn't always follow_link(text <spamtrap@dot-app.org>
    Re: WWW::Mechanize doesn't always follow_link(text <RedGrittyBrick@SpamWeary.foo>
    Re: WWW::Mechanize doesn't always follow_link(text <szrRE@szromanMO.comVE>
    Re: WWW::Mechanize doesn't always follow_link(text <john@castleamber.com>
    Re: WWW::Mechanize doesn't always follow_link(text <szrRE@szromanMO.comVE>
    Re: WWW::Mechanize doesn't always follow_link(text <rvtol+news@isolution.nl>
    Re: WWW::Mechanize doesn't always follow_link(text <rvtol+news@isolution.nl>
    Re: WWW::Mechanize doesn't always follow_link(text <m@rtij.nl.invlalid>
    Re: WWW::Mechanize doesn't always follow_link(text <rvtol+news@isolution.nl>
    Re: WWW::Mechanize doesn't always follow_link(text <rvtol+news@isolution.nl>
    Re: WWW::Mechanize doesn't always follow_link(text <rvtol+news@isolution.nl>
    Re: WWW::Mechanize doesn't always follow_link(text <ben@morrow.me.uk>
    Re: WWW::Mechanize doesn't always follow_link(text <rvtol+news@isolution.nl>
    Re: WWW::Mechanize doesn't always follow_link(text <ben@morrow.me.uk>
    Re: WWW::Mechanize doesn't always follow_link(text <m@rtij.nl.invlalid>
    Re: WWW::Mechanize doesn't always follow_link(text <m@rtij.nl.invlalid>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Sat, 26 Apr 2008 19:15:22 +0100
From: RedGrittyBrick <RedGrittyBrick@SpamWeary.foo>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <wdqdnW8qfIam7I7VRVnyvAA@bt.com>

szr wrote:
> 
> He's after a '&nbsp;', which us a non-breaking space, which is ASCII 
> 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as '&#160;' .
> 

s/ASCII/Unicode/

-- 
RGB


------------------------------

Date: Sat, 26 Apr 2008 11:59:10 -0700
From: "szr" <szrRE@szromanMO.comVE>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fuvu1u0sgi@news4.newsguy.com>

RedGrittyBrick wrote:
> szr wrote:
>>
>> He's after a '&nbsp;', which us a non-breaking space, which is ASCII
>> 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as '&#160;' .
>>
>
> s/ASCII/Unicode/

No, it's ASCII. Extended Ascii to be precise.

My ascii chart (an old printed out list I have) lists DEC 225 as 
"Lowercase 'a' with acute accent" and DEC 160 as being reserved or a 
blank (which is used as a non breaking space.)

These links show the same:
http://www.ascii-code.com/
http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml


-- 
szr 




------------------------------

Date: Sat, 26 Apr 2008 22:00:19 +0200
From: Martijn Lievaart <m@rtij.nl.invlalid>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <pan.2008.04.26.20.00.19@rtij.nl.invlalid>

On Sat, 26 Apr 2008 11:59:10 -0700, szr wrote:

> RedGrittyBrick wrote:
>> szr wrote:
>>>
>>> He's after a '&nbsp;', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as '&#160;' .
>>>
>>>
>> s/ASCII/Unicode/
> 
> No, it's ASCII. Extended Ascii to be precise.

Extended ASCII is a general name for several incompatible extensions to 
ASCII. They are NOT ASCII.

But The above IS Unicode. Which is in itself also an extension of ASCII, 
BTW.

> 
> My ascii chart (an old printed out list I have) lists DEC 225 as
> "Lowercase 'a' with acute accent" and DEC 160 as being reserved or a
> blank (which is used as a non breaking space.)
> 
> These links show the same:
> http://www.ascii-code.com/

The "extended" ASCII shown here is the Windows extension, which in itself 
is an extension of ISO-Latin-1 which is an extension of ASCII. The site 
notes this, and is in itseld correct. And it does not support your idea 
of extended ASCII.

> http://www.idevelopment.info/data/Programming/ascii_table/
PROGRAMMING_ascii_table.shtml

This site is plain wrong. Don't believe everything on tha Intuhnet.

M4


------------------------------

Date: Sat, 26 Apr 2008 13:43:16 -0700
From: "szr" <szrRE@szromanMO.comVE>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fv0455018bg@news4.newsguy.com>

Martijn Lievaart wrote:
> On Sat, 26 Apr 2008 11:59:10 -0700, szr wrote:
>
>> RedGrittyBrick wrote:
>>> szr wrote:
>>>>
>>>> He's after a '&nbsp;', which us a non-breaking space, which is
>>>> ASCII 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as
>>>> '&#160;' .
>>>>
>>>>
>>> s/ASCII/Unicode/
>>
>> No, it's ASCII. Extended Ascii to be precise.
>
> Extended ASCII is a general name for several incompatible extensions
> to ASCII. They are NOT ASCII.
>
> But The above IS Unicode. Which is in itself also an extension of
> ASCII, BTW.


The old printed out list I have doesn't make this distinction, but you 
are right the Unicode is -an- extension.

>> My ascii chart (an old printed out list I have) lists DEC 225 as
>> "Lowercase 'a' with acute accent" and DEC 160 as being reserved or a
>> blank (which is used as a non breaking space.)
>>
>> These links show the same:
>> http://www.ascii-code.com/
>
> The "extended" ASCII shown here is the Windows extension, which in
> itself is an extension of ISO-Latin-1 which is an extension of ASCII.
> The site notes this, and is in itseld correct. And it does not
> support your idea of extended ASCII.

I got the same output on my Linux system in it's xterm launched from KDE 
as I did in Secure CRT in windows, which matches up to outpout used in 
windows.

This extended ASCII set I'm refering to is what HTML (such as &nbsp; aka 
&#160;) is based on, or perhaps more precisely based on ISO-Latin-1.

>> http://www.idevelopment.info/data/Programming/ascii_table/
>> PROGRAMMING_ascii_table.shtml
>
> This site is plain wrong.

In what way? It's the same list in my O'Reilly HTML Pocket Reference, as 
is the previous link.

> Don't believe everything on tha Intuhnet.

I don't, but ut matches up with what things like HTML go by (again, 
ISO-Latin-1 unless otherwise specified in the HEAD, META tags in the 
case of HTML.)

-- 
szr 




------------------------------

Date: Sat, 26 Apr 2008 23:26:26 +0200
From: Martijn Lievaart <m@rtij.nl.invlalid>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <pan.2008.04.26.21.26.26@rtij.nl.invlalid>

On Sat, 26 Apr 2008 13:43:16 -0700, szr wrote:

>>> http://www.idevelopment.info/data/Programming/ascii_table/
>>> PROGRAMMING_ascii_table.shtml
>>
>> This site is plain wrong.
> 
> In what way? It's the same list in my O'Reilly HTML Pocket Reference, as
> is the previous link.

Welcome to the wonderful world of character sets. Or how to loose your 
sanity in a day. Read http://en.wikipedia.org/wiki/
Western_Latin_character_sets_%28computing%29 as a good introduction.

It is wrong because it says that the table is "extended ASCII". There is 
no such thing as. There's ISO-Latin-1, 2, 3, etc, the Windows character 
set, the Macintosh character set, the IBM extended ASCII set, etc. And 
those are actually used today (except possibly the Mac set, did they 
switch?), there are many, many more that are not frequently used today.

In fact, that table seems to show the Windows character set 
(Windows-1252). A character set which is actually used very little, 
Windows NT and derivatives use UCS16 by preference and the Internet uses 
mainly ISO-Latin-1 or UCS32, although ISO-Latin-15 is used too (it 
contains the Euro sign, which ISO-Latin-1 does not).

My workstation uses ISO-Latin-15. In Windows I can enter characters by by 
holding down alt and typing their IBM Extended ASCII code on the numeric 
keypad. So even saying ISO-Latin-1 is by default "the extended character 
set" doesn't hold water, although it probably is the widest used chacter 
set besides UCS16 and UCS32.

Extended ASCII is a concept, a character set that uses the ASCII codes 
for the first 127 characters. There are many extended ASCII sets. Calling 
one THE extended ASCII set is just plain wrong. And calling the Windows 
character set THE extended ASCII set is just ludicrous.

That is why the world is switching to Unicode. One characterset to rule 
them all. But even with Unicode, which one? :-)

M4
  -- I believe in standards. Everyone should have one. --


------------------------------

Date: Sat, 26 Apr 2008 17:40:51 -0400
From: Sherman Pendley <spamtrap@dot-app.org>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <m1iqy45m70.fsf@dot-app.org>

"szr" <szrRE@szromanMO.comVE> writes:

> RedGrittyBrick wrote:
>> szr wrote:
>>>
>>> He's after a '&nbsp;', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as '&#160;' .
>>>
>>
>> s/ASCII/Unicode/
>
> No, it's ASCII. Extended Ascii to be precise.

There's no such encoding as "extended ASCII." ISO/ANSI standard ASCII is
seven bits. Besides which, the document character set for HTML is clearly
stated to be Unicode in the HTML spec:

    <http://www.w3.org/TR/REC-html40/charset.html#h-5.1>

sherm--

-- 
My blog: http://shermspace.blogspot.com
Cocoa programming in Perl: http://camelbones.sourceforge.net


------------------------------

Date: Sat, 26 Apr 2008 17:55:58 -0400
From: Sherman Pendley <spamtrap@dot-app.org>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <m1ej8s5lht.fsf@dot-app.org>

"szr" <szrRE@szromanMO.comVE> writes:

> This extended ASCII set I'm refering to is what HTML (such as &nbsp; aka 
> &#160;) is based on, or perhaps more precisely based on ISO-Latin-1.

It's clearly documented by the W3C that numeric entities in HTML refer to
Unicode code points:

    <http://www.w3.org/TR/REC-html40/charset.html#h-5.1>

> I don't, but ut matches up with what things like HTML go by (again, 
> ISO-Latin-1 unless otherwise specified in the HEAD, META tags in the 
> case of HTML.)

For one thing, document encoding is an entirely different animal; numeric
entities always refer to Unicode, even when the document encoding is not
Unicode.

For another, the *correct* way to communicate document encoding, whether
it's for an HTML, XML, or some other ML document, is to include it as part
of the content-type HTTP header.

sherm--

-- 
My blog: http://shermspace.blogspot.com
Cocoa programming in Perl: http://camelbones.sourceforge.net


------------------------------

Date: Sat, 26 Apr 2008 23:20:42 +0100
From: RedGrittyBrick <RedGrittyBrick@SpamWeary.foo>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <m5mdnTKv16hdN47VnZ2dnUVZ8j6dnZ2d@bt.com>

szr wrote:
> RedGrittyBrick wrote:
>> szr wrote:
>>> He's after a '&nbsp;', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as '&#160;' .
>>>
>> s/ASCII/Unicode/
> 
> No, it's ASCII. 

Lots of people make this mistake. As your first reference says, ASCII is 
a 7-bit character set and does not define a character at code-point 160.

> Extended Ascii to be precise.

To be imprecise!

There are many different incompatible character sets and encodings that 
claim to be "Extended ASCII"

Read http://en.wikipedia.org/wiki/Extended_ascii
Especially 
http://en.wikipedia.org/wiki/Extended_ascii#Character_set_confusion

See 160 = "lowercase a acute" in these "Extended ASCII" tables:

http://www.webopedia.com/TERM/E/extended_ASCII.html
http://www.telacommunications.com/nutshell/extascii.htm
http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm
http://telecom.tbi.net/asc-ibm.html

-- 
RGB


------------------------------

Date: Sat, 26 Apr 2008 23:05:56 -0700
From: "szr" <szrRE@szromanMO.comVE>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fv154502hrv@news4.newsguy.com>

Sherman Pendley wrote:
> "szr" <szrRE@szromanMO.comVE> writes:
>
>> RedGrittyBrick wrote:
>>> szr wrote:
>>>>
>>>> He's after a '&nbsp;', which us a non-breaking space, which is
>>>> ASCII 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as
>>>> '&#160;' .
>>>>
>>>
>>> s/ASCII/Unicode/
>>
>> No, it's ASCII. Extended Ascii to be precise.
>
> There's no such encoding as "extended ASCII." ISO/ANSI standard ASCII
> is seven bits. Besides which, the document character set for HTML is
> clearly stated to be Unicode in the HTML spec:
>
>    http://www.w3.org/TR/REC-html40/charset.html#h-5.1

You're right. Perhaps too many things competing for brain-time and some 
how that got by me when I should of known better. Thanks :-)

-- 
szr 




------------------------------

Date: 28 Apr 2008 04:30:56 GMT
From: John Bokma <john@castleamber.com>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <Xns9A8DEF37BDF9castleamber@130.133.1.4>

"szr" <szrRE@szromanMO.comVE> wrote:

> John Bokma wrote:
>> "M.O.B. i L." <mikaelb@df.lth.se> wrote:
>>
>>> John Bokma wrote:
>>
>> [..]
>>
>>>> HTML::TreeBuilder, or a module it's using, returns &nbsp; as a
>>>> single character, it might be that you have to
>>>> use the code instead.
>>>>
>>>> Comment on
>>>> http://johnbokma.com/perl/search-term-suggestion-tool.html says:
>>>> (&nbsp;, stored as char 225)
>>>>
>>>> So you might want to try: "Edit\xe1Librarians".
>>>>
>>>> Wild guess.
>>>>
>>> Thanks! But it should be \xa0.
>>
>> Yeah, but HTML::TreeBuilder returns it as 225 :-D.
> 
> He's after a '&nbsp;',

Yes, I am aware of that. And somehow HTML::TreeBuilder or a module it uses 
returns &nbsp; as \xe1.

-- 
John

http://johnbokma.com/perl/


------------------------------

Date: Sun, 27 Apr 2008 22:28:06 -0700
From: "szr" <szrRE@szromanMO.comVE>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fv3n9802unq@news4.newsguy.com>

John Bokma wrote:
> "szr" <szrRE@szromanMO.comVE> wrote:
>
>> John Bokma wrote:
>>> "M.O.B. i L." <mikaelb@df.lth.se> wrote:
>>>
>>>> John Bokma wrote:
>>>
>>> [..]
>>>
>>>>> HTML::TreeBuilder, or a module it's using, returns &nbsp; as a
>>>>> single character, it might be that you have to
>>>>> use the code instead.
>>>>>
>>>>> Comment on
>>>>> http://johnbokma.com/perl/search-term-suggestion-tool.html says:
>>>>> (&nbsp;, stored as char 225)
>>>>>
>>>>> So you might want to try: "Edit\xe1Librarians".
>>>>>
>>>>> Wild guess.
>>>>>
>>>> Thanks! But it should be \xa0.
>>>
>>> Yeah, but HTML::TreeBuilder returns it as 225 :-D.
>>
>> He's after a '&nbsp;',
>
> Yes, I am aware of that. And somehow HTML::TreeBuilder or a module it
> uses returns &nbsp; as \xe1.

Yes. The question whether this is a bug in HTML::TreeBuilder or is there 
a logical reason for this? DEC 225 doesn't seem to be a space of any 
kind in any ascii list I've checked, but I don't doubt I've missed one 
somewhere :-)

-- 
szr 




------------------------------

Date: Mon, 28 Apr 2008 09:15:20 +0200
From: "Dr.Ruud" <rvtol+news@isolution.nl>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fv44p5.1ho.1@news.isolution.nl>

RedGrittyBrick schreef:
> szr:

>> He's after a '&nbsp;', which us a non-breaking space, which is ASCII
>> 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as '&#160;' .
> 
> s/ASCII/Unicode/

Exactly. ISO-8859-* too. 

-- 
Affijn, Ruud

"Gewoon is een tijger."


------------------------------

Date: Mon, 28 Apr 2008 09:23:16 +0200
From: "Dr.Ruud" <rvtol+news@isolution.nl>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fv452i.1c0.1@news.isolution.nl>

szr schreef:
> RedGrittyBrick:
>> szr:

>>> He's after a '&nbsp;', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as '&#160;' .
>>
>> s/ASCII/Unicode/
>
> No, it's ASCII. Extended Ascii to be precise.

The ASCII character set is a 7-bit code and it contains 128 characters,
not more.
See also `man ascii`.

-- 
Affijn, Ruud

"Gewoon is een tijger."



------------------------------

Date: Mon, 28 Apr 2008 10:55:20 +0200
From: Martijn Lievaart <m@rtij.nl.invlalid>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <pan.2008.04.28.08.55.20@rtij.nl.invlalid>

On Mon, 28 Apr 2008 09:15:20 +0200, Dr.Ruud wrote:

> RedGrittyBrick schreef:
>> szr:
> 
>>> He's after a '&nbsp;', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as '&#160;' .
>> 
>> s/ASCII/Unicode/
> 
> Exactly. ISO-8859-* too.

No, no, HTML uses Unicode codepoints (which in this case coincide, but 
that's beside the (code)point).

M4


------------------------------

Date: Mon, 28 Apr 2008 11:47:42 +0200
From: "Dr.Ruud" <rvtol+news@isolution.nl>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fv4dko.1hk.1@news.isolution.nl>

Martijn Lievaart schreef:

> ISO-Latin-1

Normally called "ISO 8859-1" or "ISO Latin 1" or just "Latin-1". 

Though ITYM: "ISO-8859-1". (the real one with two hyphens) 

-- 
Affijn, Ruud

"Gewoon is een tijger."


------------------------------

Date: Mon, 28 Apr 2008 11:40:18 +0200
From: "Dr.Ruud" <rvtol+news@isolution.nl>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fv4dba.1h8.1@news.isolution.nl>

Martijn Lievaart schreef:
> Dr.Ruud:
>> RedGrittyBrick:
>>> szr:

>>>> He's after a '&nbsp;', which us a non-breaking space, which is
>>>> ASCII 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as
>>>> '&#160;' .
>>>
>>> s/ASCII/Unicode/
>>
>> Exactly. ISO-8859-* too.
>
> No, no, HTML uses Unicode codepoints (which in this case coincide, but
> that's beside the (code)point).

No, no, no, no, that depends on the encoding being used. Yes, numeric
references always refer to Universal Character Set code points,
regardless of the page's encoding, but HTML is not "limited" to that.

See also http://www.xs4all.nl/~rvtol/htmlcods.html which has been
rendered in many different (so non-"standard") ways in the past 10+
years. :)

-- 
Affijn, Ruud

"Gewoon is een tijger."



------------------------------

Date: Mon, 28 Apr 2008 11:59:40 +0200
From: "Dr.Ruud" <rvtol+news@isolution.nl>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fv4e7k.1f0.1@news.isolution.nl>

Martijn Lievaart schreef:

> A character set which is actually used very little,
> Windows NT and derivatives use UCS16 by preference and the Internet
> uses mainly ISO-Latin-1 or UCS32, although ISO-Latin-15 is used too
> (it contains the Euro sign, which ISO-Latin-1 does not).
>
> My workstation uses ISO-Latin-15. In Windows I can enter characters
> by by holding down alt and typing their IBM Extended ASCII code on
> the numeric keypad. So even saying ISO-Latin-1 is by default "the
> extended character set" doesn't hold water, although it probably is
> the widest used chacter set besides UCS16 and UCS32.

s/ISO-Latin/ISO Latin/g

With UCS16 you probably mean "UCS-2", or "UTF-16" (which is an extension
of UCS-2).
With UCS32 you probably mean "UCS-4" (which is also called "UTF-32").

-- 
Affijn, Ruud

"Gewoon is een tijger."



------------------------------

Date: Mon, 28 Apr 2008 16:55:26 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <ek8he5-443.ln1@osiris.mauzo.dyndns.org>


Quoth "Dr.Ruud" <rvtol+news@isolution.nl>:
> Martijn Lievaart schreef:
> 
> > ISO-Latin-1
> 
> Normally called "ISO 8859-1" or "ISO Latin 1" or just "Latin-1". 

They are different... ISO Latin 1 is a character set (an unordered
collection of characters). ISO-8859-1 is a particular encoding of that
character set as 8-bit integers. There are others; in particular some
EBCDIC codepages.

Ben

-- 
  Joy and Woe are woven fine,
  A Clothing for the Soul divine       William Blake
  Under every grief and pine          'Auguries of Innocence'
  Runs a joy with silken twine.                                ben@morrow.me.uk
-- 
For far more marvellous is the truth than any artists of the past imagined!
Why do the poets of the present not speak of it? What men are poets who can
speak of Jupiter if he were like a man, but if he is an immense spinning
sphere of methane and ammonia must be silent? [Feynmann]     ben@morrow.me.uk


------------------------------

Date: Mon, 28 Apr 2008 19:38:03 +0200
From: "Dr.Ruud" <rvtol+news@isolution.nl>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <fv5953.1n0.1@news.isolution.nl>

Ben Morrow schreef:
> Dr.Ruud:
>> Martijn Lievaart:

>>> ISO-Latin-1
>>
>> Normally called "ISO 8859-1" or "ISO Latin 1" or just "Latin-1".
>
> They are different... ISO Latin 1 is a character set (an unordered
> collection of characters). ISO-8859-1 is a particular encoding of that
> character set as 8-bit integers. There are others; in particular some
> EBCDIC codepages.

"ISO-8859-1" wasn't mentioned in the part that you quote, so I don't see
what you mean with "They".

-- 
Affijn, Ruud

"Gewoon is een tijger."



------------------------------

Date: Mon, 28 Apr 2008 19:16:39 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <7tghe5-kp9.ln1@osiris.mauzo.dyndns.org>

[sorry about the doubled .sig in my previous post. I'll try not to let
it happen again... :(]

Quoth "Dr.Ruud" <rvtol+news@isolution.nl>:
> Ben Morrow schreef:
> > Dr.Ruud:
> >> Martijn Lievaart:
> 
> >>> ISO-Latin-1
> >>
> >> Normally called "ISO 8859-1" or "ISO Latin 1" or just "Latin-1".
> >
> > They are different... ISO Latin 1 is a character set (an unordered
> > collection of characters). ISO-8859-1 is a particular encoding of that
> > character set as 8-bit integers. There are others; in particular some
> > EBCDIC codepages.
> 
> "ISO-8859-1" wasn't mentioned in the part that you quote, so I don't see
> what you mean with "They".

I'm confused. You said "ISO 8859-1" and "ISO Latin 1" as though they
were equivalent, which they aren't. If you're trying to make 
"ISO 8859-1" (sans hyphen) equivalent to "ISO Latin 1" but "ISO-8859-1"
(with hyphen) not, then I'd call that more than a little confusing. For
a start, how would you interpret "ISO 8859-9"? As the Latin-9 character
set used by ISO-8859-15, or as the ISO-8859-9 encoding of the Latin-5
character set?

FWIW, Perl agrees with me:

~% perl -MEncode -le'print Encode::resolve_alias "ISO 8859-9"'
iso-8859-9
~% perl -MEncode -le'print Encode::resolve_alias "ISO Latin-9"'
iso-8859-15

though allowing 'Latin-N' to mean 'the usual 8859-N encoding of the
Latin-9 character set' is arguably only increasing the confusion between
the two.

Ben

-- 
For the last month, a large number of PSNs in the Arpa[Inter-]net have been
reporting symptoms of congestion ... These reports have been accompanied by an
increasing number of user complaints ... As of June,... the Arpanet contained
47 nodes and 63 links. [ftp://rtfm.mit.edu/pub/arpaprob.txt] * ben@morrow.me.uk


------------------------------

Date: Mon, 28 Apr 2008 22:04:10 +0200
From: Martijn Lievaart <m@rtij.nl.invlalid>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <pan.2008.04.28.20.04.10@rtij.nl.invlalid>

On Mon, 28 Apr 2008 11:59:40 +0200, Dr.Ruud wrote:

> s/ISO-Latin/ISO Latin/g
> 
> With UCS16 you probably mean "UCS-2", or "UTF-16" (which is an extension
> of UCS-2).
> With UCS32 you probably mean "UCS-4" (which is also called "UTF-32").

I stand corrected, I ment UCS-2 and -4. I was indeed confused by the UTF 
encodings.

M4


------------------------------

Date: Mon, 28 Apr 2008 22:07:51 +0200
From: Martijn Lievaart <m@rtij.nl.invlalid>
Subject: Re: WWW::Mechanize doesn't always follow_link(text
Message-Id: <pan.2008.04.28.20.07.51@rtij.nl.invlalid>

On Mon, 28 Apr 2008 11:40:18 +0200, Dr.Ruud wrote:

> Martijn Lievaart schreef:
>> Dr.Ruud:
>>> RedGrittyBrick:
>>>> szr:
> 
>>>>> He's after a '&nbsp;', which us a non-breaking space, which is ASCII
>>>>> 0xA0 hex or 160 dec. '&nbsp;' can even be re-written as '&#160;' .
>>>>
>>>> s/ASCII/Unicode/
>>>
>>> Exactly. ISO-8859-* too.
>>
>> No, no, HTML uses Unicode codepoints (which in this case coincide, but
>> that's beside the (code)point).
> 
> No, no, no, no, that depends on the encoding being used. Yes, numeric
> references always refer to Universal Character Set code points,
> regardless of the page's encoding, but HTML is not "limited" to that.

No, no, no, no, no :-) You already said it yourself, numeric references 
always refer to Unicode codepoints. That's the only point I was trying to 
make, and why you cannot substititute ISO-8859-* above.

M4



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 1490
***************************************


home help back first fref pref prev next nref lref last post