[3724] in BarnOwl Developers
Re: utf8proc: case-folded NFD needs extra normalization for combining
daemon@ATHENA.MIT.EDU (Anders Kaseorg)
Tue Feb 4 07:05:46 2014
Date: Tue, 4 Feb 2014 07:05:39 -0500 (EST)
From: Anders Kaseorg <andersk@MIT.EDU>
To: Jan Behrens <info2012@public-software-group.org>
cc: barnowl-dev@mit.edu
In-Reply-To: <20140204123752.dc1e68a66f236c53a418dbb9@public-software-group.org>
On Tue, 4 Feb 2014, Jan Behrens wrote:
> I just updated the project page to mention this issue. Unfortunately,=20
> utf8proc does the UTF-decoding and the Unicode processing in one step,=20
> thus fixing this bug efficently(!) seems to be non-trivial. If you know=
=20
> of any implementation guidelines how to efficiently handle this corner=20
> case without having to apply normalization several times, I would=20
> appreciate any references.
Well, more recent versions of Unicode offer a little bit more guidance. =20
See Default Caseless Matching in=20
http://www.unicode.org/versions/Unicode6.3.0/ch03.pdf:
=E2=80=9CNormalization is not required before case folding, except for the=
=20
character U+0345 =E2=97=8C=CD=85 COMBINING GREEK YPOGEGRAMMENI and any char=
acters that=20
have it as part of their decomposition, such as U+1FC3 =E1=BF=83 GREEK SMAL=
L=20
LETTER ETA WITH YPOGEGRAMMENI. In practice, optimized versions of=20
canonical caseless matching can catch these special cases, thereby=20
avoiding an extra normalization step for each comparison.=E2=80=9D
=E2=80=9CCaseless matching for identifiers can be simplified and optimized =
by=20
using the NFKC_Casefold mapping. That mapping incorporates internally the=
=20
derived results of iterating case folding and NFKD normalization.=E2=80=9D
> I would like to get back to care for utf8proc as soon as possible, but
> after having been very busy during the last weeks, I got the flu and
> need to recover for a while.
Feel better,
Anders