[3710] in BarnOwl Developers
Re: utf8proc: case-folded NFD needs extra normalization for
daemon@ATHENA.MIT.EDU (Jan Behrens)
Sun Jan 12 16:12:08 2014
Date: Sun, 12 Jan 2014 22:12:02 +0100
From: Jan Behrens <info2012@public-software-group.org>
To: Anders Kaseorg <andersk@mit.edu>
Cc: barnowl-dev@mit.edu
In-Reply-To: <52C4F53D.1010001@mit.edu>
Dear Anders,
On Thu, 02 Jan 2014 00:12:29 -0500
Anders Kaseorg <andersk@MIT.EDU> wrote:
> utf8proc_map(…, UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE) maps
> U+03C9 U+0345 U+0342 (ω+◌ͅ+◌͂)
> to
> U+03C9 U+03B9 U+0342 (ω ι+◌͂).
>
> However, according to the Unicode specification for canonical
> caseless matching, this operation requires a normalization step at
> _both_ the beginning and the end:
> NFD(toCasefold(NFD(X)))
> and similarly for compatibility caseless matching:
> NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
> “The invocations of normalization before case folding in the
> preceding definitions are to catch very infrequent edge cases.
> Normalization is not required before case folding, except for the
> character U+0345 ◌ͅ COMBINING GREEK YPOGEGRAMMENI and any characters
> that have it as part of their decomposition, such as U+1FC3 ῃ GREEK
> SMALL LETTER ETA WITH YPOGEGRAMMENI.”
> http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
>
> In this case, the NFD form of the input string has the two combining
> characters swapped:
> U+03C9 U+0342 U+0345 (ω+◌͂+◌ͅ)
> which toCasefold with NFD now maps to
> U+03C9 U+0342 U+03B9 (ω+◌͂ ι).
>
> Anders
Thank you for your notice, I will have a look at that.
Kind regards
Jan Behrens
--
Public Software Group e. V.
Johannisstr. 12, 10117 Berlin, Germany
www.public-software-group.org
vorstand at public-software-group.org
eingetragen in das Vereinregister
des Amtsgerichtes Charlottenburg
Registernummer: VR 28873 B
Vorstände (einzelvertretungsberechtigt):
Jan Behrens
Axel Kistner
Andreas Nitsche
Björn Swierczek