[3710] in BarnOwl Developers

home help back first fref pref prev next nref lref last post

Re: utf8proc: case-folded NFD needs extra normalization for

daemon@ATHENA.MIT.EDU (Jan Behrens)
Sun Jan 12 16:12:08 2014

Date: Sun, 12 Jan 2014 22:12:02 +0100
From: Jan Behrens <info2012@public-software-group.org>
To: Anders Kaseorg <andersk@mit.edu>
Cc: barnowl-dev@mit.edu
In-Reply-To: <52C4F53D.1010001@mit.edu>

Dear Anders,

On Thu, 02 Jan 2014 00:12:29 -0500
Anders Kaseorg <andersk@MIT.EDU> wrote:

> utf8proc_map(…, UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE) maps
>    U+03C9 U+0345 U+0342 (ω+◌ͅ+◌͂)
> to
>    U+03C9 U+03B9 U+0342 (ω ι+◌͂).
> 
> However, according to the Unicode specification for canonical
> caseless matching, this operation requires a normalization step at
> _both_ the beginning and the end:
>    NFD(toCasefold(NFD(X)))
> and similarly for compatibility caseless matching:
>    NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
> “The invocations of normalization before case folding in the
> preceding definitions are to catch very infrequent edge cases.
> Normalization is not required before case folding, except for the
> character U+0345 ◌ͅ COMBINING GREEK YPOGEGRAMMENI and any characters
> that have it as part of their decomposition, such as U+1FC3 ῃ GREEK
> SMALL LETTER ETA WITH YPOGEGRAMMENI.”
> http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
> 
> In this case, the NFD form of the input string has the two combining 
> characters swapped:
>    U+03C9 U+0342 U+0345 (ω+◌͂+◌ͅ)
> which toCasefold with NFD now maps to
>    U+03C9 U+0342 U+03B9 (ω+◌͂ ι).
> 
> Anders

Thank you for your notice, I will have a look at that.

Kind regards
Jan Behrens

-- 
Public Software Group e. V.
Johannisstr. 12, 10117 Berlin, Germany

www.public-software-group.org
vorstand at public-software-group.org

eingetragen in das Vereinregister
des Amtsgerichtes Charlottenburg
Registernummer: VR 28873 B

Vorstände (einzelvertretungsberechtigt):
Jan Behrens
Axel Kistner
Andreas Nitsche
Björn Swierczek

home help back first fref pref prev next nref lref last post