[6547] in www-talk@info.cern.ch

home help back first fref pref prev next nref lref last post

RE: WWW support for Cyrillic (and UNICODE)

daemon@ATHENA.MIT.EDU (Vladimir Sukonnik, Process Softwar)
Thu Nov 3 08:23:30 1994

Date: Thu, 3 Nov 1994 14:22:00 +0100
Errors-To: listmaster@www0.cern.ch
Reply-To: sukonnik@elnath.process.com
From: sukonnik@elnath.process.com (Vladimir Sukonnik, Process Software Corp)
To: Multiple recipients of list <www-talk@www0.cern.ch>

>Wed, 2 Nov 94 16:19:11 CST
>Date: Wed, 2 Nov 94 16:19:11 CST
>From: "Richard L. Goerwitz" <goer@midway.uchicago.edu>
>Subject: RE: WWW support for Cyrillic (and UNICODE)

>I've heard the Microsoft hoopla, but so far can't determine what it
>is all about.  From what I can tell, Microsoft is using Unicode ac-
>cording to the old, internationalization/localization model.  It is
>not using Unicode as it should be used, namely as a multilingual en-
>coding standard.  Take a look at Apple's WorldScript for an example
>of that.

I appologize for taking up the bandwidth for something that may be
trivial to the list (please let me know if it is), but I need a little education.

Would you please explain 

1. What the old internationalization/localization model is, and
2. What the multilingual encoding standard is,  and 
3. How does UNICODE fits into all of this. 

>If you have the specs on hand, tell me what the 32-bit GUI does with
>characters in the Arabic and Hebrew code blocks, by the way.  I am
>not fishing for a particular response here, but I suspect that the
>system is nonconformant in that it will recognize the codes, but yet
>fail to show the behavior outlined in appendix A of the standard.

Below, I am including a write-up on UNICODE that I found in the Microsoft
Development Libary (the October '94 issue). I would very much like to understand
what the issues are. Thanks for taking your time to explain it.


>Richard Goerwitz

		Best regards,
		Vladimir.

Since code pages are different for each script and operating environment, attempts to 
standardize and consolidate all code pages onto one code page are now underway. One 
standard is called Unicode, an effort driven by Apple, Borland, Digital, 
Hewlett-Packard, IBM, Lotus, Metaphor, Microsoft, Next, Novell,  Research Libraries 
Group, Sun, WordPerfect, and Xerox. 

Unicode is a pure 16-bit character encoding that encompasses all characters used in 
general text interchange. The two volume standard is available from Addison Wesley as 
The Unicode Standard; Worldwide Character Encoding Volume I and The Unicode Standard; 
Worldwide Character Encoding Volume 2 (ISBN 0-201-56788-1 and ISBN 0-201-60845-6 
respectively).

The ISO DIS (Draft Industrial Standard) 10646.2 was merged with Unicode version 1.0 to 
form Unicode's current version, 1.1.

Differences between Unicode and Existing Code Pages

The main differences between Unicode and existing code pages are as follows:

	All characters are 16 bits wide. Unlike DBCS there are no lead bytes. Random 
access to character strings is also possible and programs generally dont need to 
maintain state information when parsing strings.

	A Unicode index refers unequivocally to a given character; for example, the 
symbol happy face and the control code Ctrl-A are two different characters in Unicode.

	Unicode is a character encoding. Ligatures are not characters, but glyphs. Text 
in Unicode is always in character form. Only in final output stage does an application 
or graphics engine combine f and i and substitute the glyph for the fi ligature.

	There are non-spacing accents in Unicode that can be combined with base 
characters to create composite letters.
Unicode provides mappings to and from all the important single-byte code pages in use on 
computers today.

Unicode Goals

The following are the basic goals of Unicode:

	Eliminate special-case Systems and Applications code for multiple character 
sets, thus speeding up localization and reducing testing time.

	Make a larger range of characters available than will fit in a single-byte code 
page.

	Ensure that character code is independent of compression and text formatting 
considerations.

	Make code more efficient when used as an internal processing code.

	Ensure that Unicode is complete when used as data interchange or reference code.

Unicode provides a model that has much of the simplicity and efficiency of the 
plain-text model, but with greater international capabilities. Microsoft is supporting 
Unicode in the 32-bit API for Windows (Win32 API), and several other companies are 
working on Unicode implementation.

Unicode is a Character Encoding

Unlike DBCS, Unicode is not a variable-length encoding. Text in Unicode cannot be passed 
to functions that are expecting zero-terminated ASCII strings. The first 256 characters 
in Unicode have the same layout as the international standard ISO 8859/1. However, 
Unicode is zero-extended to 16-bits. The terminator in a Unicode string is 0x0000 
because many Unicode characters contain one null byte.

ISO 8859/1 was used as the source for the Windows ANSI character set. Except for the 80H 
through 9FH range, which is the C1 control-code range, there is a one-to-one 
correspondence between Windows ANSI and Unicode. The few additional characters defined 
in the Windows ANSI set in this range have corresponding characters elsewhere in 
Unicode.

Volume 1 of the Unicode 1.0 standard was published in the fall of 1991. This volume 
contains assigned codes for every major alphabet and symbol in the world except for the 
Han characters of the Korean, Chinese, and Japanese writing systems. Volume 2, 
containing a unified Han character set, became available in early 1992.

 



+---------------------------------------------------------------+
| Vladimir Sukonnik		Voice: 1-508-879-6994		|
| Principal Software Engineer	http://www.process.com		|
| Process Software Corp		Fax:   1-508-879-0042		|
| 959 Concord Street		E-mail: sukonnik@process.com or |
| Framingham, MA 01760 USA		sukonnik@bumetb.bu.edu	|
+---------------------------------------------------------------+


home help back first fref pref prev next nref lref last post