| home | help | back | first | fref | pref | prev | next | nref | lref | last | post |
Date: Wed, 30 Nov 1994 04:04:47 +0100 Errors-To: listmaster@www0.cern.ch Reply-To: gtn@ebt.com From: Gavin Nicol <gtn@ebt.com> To: Multiple recipients of list <www-talk@www0.cern.ch> >BTW, you might want to check with some Japanese users before >deciding that the UTF-8 encoding of Unicode characters is the >right solution for Japanese. Common encodings in Japan are >Shift-JIS and EUC, and there is a *far* greater installed >base of users of these encodings than there is for UTF-8. >Therefore, users will probably want something that works >with their existing software rather than an encoding that >has been dictated by a group that has few Asian representatives. First, let me say that I live in Japan, and have done for over 8 years now. I think I know the industry here reasonably well. Most people here who have something against Unicode are either pushing their own standards, or have some *political* reason for not agreeing. One can waffle about the fact that the Kanji Unification leads to cases where the character displayed is not what one would expect, but that is an *application* and *display* problem, not a codeset problem. Due to the line of work I'm in, I have to talk to many people here who are in the publishing, printing, and computer industries. When I ask them how we will deal with large scale international document exchange (and acutally, this is relevant to OMG too), they always start by saying "well, let's have the browsers understand EUC, and SJIS". I then point out that there are *dozens* of local character encoding systems, and there are more being invented every day. I ask them if it is reasonable to expect every browser to be able to handle a potentially huge number of encoding systems. I think not. Rather, what I think is a more reasonable way, is to use UTF-8 or UTF-7 for charcater encoding while the document is *in the process of being transferred*. It is then converted to a local encoding by the browser. With such a system we might end up with many cases like: Windows "The Ether where electrons flow" Unix SJIS----------------->UTF-8--------------------EUC And this *vastly* simplifies the problem. Instead of having to potentially understand hundreds of encoding, and being able to convert them all into the local encoding (and UTF would be one of them), one now only has to understand *two* encoding, and be able to convert between them. Many people will now be saying "yes. but what about the cases where you *can't* display the document, and the case where the Kanji Unification rears it's head". Well, in the first case, one can offer to save the document in it's UTF form, which would be the same even if UTF-8 wasn't being used (except you'd have to save it in some other format). In addition, I think that as computational linguistics improve, it will be possible to convert the text into a local encoding of the phonetic representation of the text (for example, the Kanji "hara" would be converted from the Unicode encoding to the word "hara" on ASCII-Only systems). This will probably happen *after* most systems become internationalised enough to handle the most common cases :-(. For the Unification problem, I think it should be possible to mark up the UTF-8 document such that the display characteristics will be preserved as part of the translation from the local encoding to the UTF-8 encoding. For example, is one sends a Kanji, one could mark the document language as "Japanese" or "Chinese", or even do it with some kind of inline escape sequence (using the area reserved for extensions). When I discuss these ideas with people here, they are very receptive. Many people say that more conservative, and governmental thinkers are very stubborn, so things will not change quickly. I think that by implementing such systems *now*, we can force them to accept what I think is a very practical solution to handling the myriad character encoding schemes extant in the world. I should note that the stylesheet format which is being defined by SGML-Open *will* be able to handle international display characteristics like bidirectional text, though it will be an optional feature in the lowest functionality version. The upgrade path will be seemless, or close to it.
| home | help | back | first | fref | pref | prev | next | nref | lref | last | post |