[6012] in www-talk@info.cern.ch

home help back first fref pref prev next nref lref last post

Re: Putting the "World" back in WWW...

daemon@ATHENA.MIT.EDU ((Frank Rojas ))
Mon Oct 3 22:14:08 1994

Date: Tue, 4 Oct 1994 03:09:13 +0100
Errors-To: listmaster@www0.cern.ch
Errors-To: listmaster@www0.cern.ch
Reply-To: fxrojas@nlsarch.austin.ibm.com
From: (Frank Rojas  ) <fxrojas@nlsarch.austin.ibm.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>

    From:  hallam@dxal18.cern.ch (HALLAM-BAKER Phillip)
    Date: Mon, 03 Oct 94 20:36:56 +0100

    It is simply another content encoding to deal with.

    A charset module can easilly be written to convert fairly arbitrary encodings
    into UNICODE tokens. This can also do UTS, ASCII, ISO-8893, JIS, and whacky
    Russian etc. encodings.

I'm not sure I follow... excuse me if I missed the point ... but it sounds like
you are suggesting we put "ANY ENCODING" in the document and have each viewer
convert into UNICODE... 

If so, this will cause MAJOR interoperability problems across the network.  
Expecting every client to be convert to from every possible encoding will never 
work - consider Latin-1 has :  PC 437, PC 850, EBCDIC, ISO8859, UTF, UCS, 
other PC national code pages...

Rather, the document should be supplied in a canonical encoding, i.e. UCS, 
that each client should just provide 1 conversion at the max.  

    On the other side I am looking into a scheme of `multifonts' which allows 
    several X11 fonts to be compounded into a single UNICODE mapping. 

If this is so, then storing them with UNICODE makes more sense since such
fonts will exist... and there is no conversion at view time.

    Because the
    display module is directly engaged we can translate into the target font
    character by character. This scheme means that the UNICODE stuff does not cause 
    increased internal storage requirements.

But this causes a nightmare for system administrators that need to provide 
conversions from any other encoding to UNICODE... and puts the burden of 
conversion on the clients each time the document is accessed rather then on the 
supplier one time.

I fully realize we can't convert over to a single canonical form overnight.
But we should provide the convention that re-enforce simple administration
and enhance interoperability for all systems.

    So the content type is

    text/html                   (default to ISO-8893-1)
    text/html; charset=UNICODE
    text/html; charset=UTF
    text/html; charset=ISO..
    text/html; charset=JIS
    etc.

Here is a proposal that would help to converge on a uniform 
canonical encoding....

    text/html; charset=charset_name

where

charset_name            := UCS2_ID'_'UCS_plane 
                           | DCE_ID'_'DCE_encoding
                           | private_encoding

UCS2_ID                 := 'UCS2'

UCS_plane               := '0x'<4_hexdigits>

DCE_ID                  := 'DCE'

DCE_encoding            := '0x'<8_hexdigits>

private_encoding        := portable character string

For UCS2_ID, the characters will be encoded using 1 byte values corresponding to it's 
UCS_plane within UCS-2.   The value of any given character can be combined with the 
UCS_plane value to obtain an UNICODE value.  For example, UCS2_0x0000 is Latin-1!
For CJK, the DCE_ID of UCS-2 is recommended (see below).

For any DCE value, the characters will be encoded using the encoding registry
developed for DCE.  Refer to DCE RFA 41.1 and the X/Open Federated Naming 
Specification.  The latter relies on the DCE encoding registry
to tag names of different encodings.  The nice thing is that the DCE
registry includes UCS-2 as one of the encodings...
for example the UCS-2 level 3 DCE registered value is: charset=DCE_0x00010102

Frank

PS. I really don't like the use of hexdigits in the names but it is more 
precise...

home help back first fref pref prev next nref lref last post