[3680] in BarnOwl Developers
utf8proc: case-folded NFD needs extra normalization for combining
daemon@ATHENA.MIT.EDU (Anders Kaseorg)
Thu Jan 2 00:12:36 2014
Date: Thu, 02 Jan 2014 00:12:29 -0500
From: Anders Kaseorg <andersk@MIT.EDU>
To: info2012@public-software-group.org
CC: barnowl-dev@mit.edu
utf8proc_map(…, UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE) maps
U+03C9 U+0345 U+0342 (ω+◌ͅ+◌͂)
to
U+03C9 U+03B9 U+0342 (ω ι+◌͂).
However, according to the Unicode specification for canonical caseless
matching, this operation requires a normalization step at _both_ the
beginning and the end:
NFD(toCasefold(NFD(X)))
and similarly for compatibility caseless matching:
NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
“The invocations of normalization before case folding in the preceding
definitions are to catch very infrequent edge cases. Normalization is
not required before case folding, except for the character U+0345 ◌ͅ
COMBINING GREEK YPOGEGRAMMENI and any characters that have it as part of
their decomposition, such as U+1FC3 ῃ GREEK SMALL LETTER ETA WITH
YPOGEGRAMMENI.”
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
In this case, the NFD form of the input string has the two combining
characters swapped:
U+03C9 U+0342 U+0345 (ω+◌͂+◌ͅ)
which toCasefold with NFD now maps to
U+03C9 U+0342 U+03B9 (ω+◌͂ ι).
Anders