[3680] in BarnOwl Developers

home help back first fref pref prev next nref lref last post

utf8proc: case-folded NFD needs extra normalization for combining

daemon@ATHENA.MIT.EDU (Anders Kaseorg)
Thu Jan 2 00:12:36 2014

Date: Thu, 02 Jan 2014 00:12:29 -0500
From: Anders Kaseorg <andersk@MIT.EDU>
To: info2012@public-software-group.org
CC: barnowl-dev@mit.edu

utf8proc_map(…, UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE) maps
   U+03C9 U+0345 U+0342 (ω+◌ͅ+◌͂)
to
   U+03C9 U+03B9 U+0342 (ω ι+◌͂).

However, according to the Unicode specification for canonical caseless 
matching, this operation requires a normalization step at _both_ the 
beginning and the end:
   NFD(toCasefold(NFD(X)))
and similarly for compatibility caseless matching:
   NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
“The invocations of normalization before case folding in the preceding 
definitions are to catch very infrequent edge cases.  Normalization is 
not required before case folding, except for the character U+0345 ◌ͅ 
COMBINING GREEK YPOGEGRAMMENI and any characters that have it as part of 
their decomposition, such as U+1FC3 ῃ GREEK SMALL LETTER ETA WITH 
YPOGEGRAMMENI.”
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf

In this case, the NFD form of the input string has the two combining 
characters swapped:
   U+03C9 U+0342 U+0345 (ω+◌͂+◌ͅ)
which toCasefold with NFD now maps to
   U+03C9 U+0342 U+03B9 (ω+◌͂ ι).

Anders

home help back first fref pref prev next nref lref last post