[5937] in www-talk@info.cern.ch

home help back first fref pref prev next nref lref last post

Putting the "World" back in WWW...

daemon@ATHENA.MIT.EDU (John Ludeman)
Fri Sep 30 01:34:38 1994

Date: Fri, 30 Sep 1994 06:28:32 +0100
Errors-To: listmaster@www0.cern.ch
Errors-To: listmaster@www0.cern.ch
Reply-To: johnl@microsoft.com
From: John Ludeman <johnl@microsoft.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>


Microsoft has discussed the Web international issue with some web 
related vendors and would like to make a proposal, but first, let me 
give a few (unofficial) definitions:

Unicode - A 16 bit character encoding scheme that has defined code 
points for every character in virtually every language in the world.  
All common  languages can be expressed in this scheme and include: 
18,000 Han characters set by industry standards in China, Japan, Korea 
and Taiwan; other supported languages include  Greek, Hebrew, Latin, 
Pali, Sanskrit and literary Chinese.  In addition, several hundred 
common math symbols, geometric shapes, and basic dingbats are defined.  
At the beginning of a Unicode document is a two byte signature called 
the Byte Ordering Mark (BOM) of 0xFEFF so clients can resolve big 
endian/little endian differences.

UTF-8 - A special byte encoding that makes full 8 bit transfer over 7 
bit  gateways safe.  Since HTTP is defined to be 8 bit clean, this is 
not needed.


The HTTP protocol (since it is a protocol) will always be in Latin-1 
using 8 bit characters (and hopefully this will move to a binary scheme).

A good solution for the multi-lingual problem would be to use Unicode 
as the character encoding.  Issues of left to right and right to left 
text are simply resolved by the Unicode character code (no language or 
locale information needed).  A document can contain any language at any 
place in the document without special delimiters.  The only limitation 
is whether the browser has the font for the language and supports 
non-left to right languages (which a real international browser should).

To go along with this, I would suggest adding a new MIME type - 
"text/uni-html" that indicates the document is using Unicode.   This 
allows locations that don't need the extra information to not accept 
unicode versions of the documents (saves on bandwidth).  This also 
means that existing servers work and don't care they are serving up 
Unicode docs.  The HTML definition doesn't change except a remark needs 
to be added that if the document is a Unicode HTML document, then the 
BOM will be the first two bytes of the document.

I have been writing Unicode networking applications for many years and 
I'm sold on it's simplicity and elegance.  There is simply no better 
way to internationalize a product.  If other vendors are interested in 
pursuing this with Microsoft, please let me know.

John

home help back first fref pref prev next nref lref last post