[3724] in BarnOwl Developers

home help back first fref pref prev next nref lref last post

Re: utf8proc: case-folded NFD needs extra normalization for combining

daemon@ATHENA.MIT.EDU (Anders Kaseorg)
Tue Feb 4 07:05:46 2014

Date: Tue, 4 Feb 2014 07:05:39 -0500 (EST)
From: Anders Kaseorg <andersk@MIT.EDU>
To: Jan Behrens <info2012@public-software-group.org>
cc: barnowl-dev@mit.edu
In-Reply-To: <20140204123752.dc1e68a66f236c53a418dbb9@public-software-group.org>

On Tue, 4 Feb 2014, Jan Behrens wrote:
> I just updated the project page to mention this issue. Unfortunately,=20
> utf8proc does the UTF-decoding and the Unicode processing in one step,=20
> thus fixing this bug efficently(!) seems to be non-trivial. If you know=
=20
> of any implementation guidelines how to efficiently handle this corner=20
> case without having to apply normalization several times, I would=20
> appreciate any references.

Well, more recent versions of Unicode offer a little bit more guidance. =20
See Default Caseless Matching in=20
http://www.unicode.org/versions/Unicode6.3.0/ch03.pdf:

=E2=80=9CNormalization is not required before case folding, except for the=
=20
character U+0345 =E2=97=8C=CD=85 COMBINING GREEK YPOGEGRAMMENI and any char=
acters that=20
have it as part of their decomposition, such as U+1FC3 =E1=BF=83 GREEK SMAL=
L=20
LETTER ETA WITH YPOGEGRAMMENI.  In practice, optimized versions of=20
canonical caseless matching can catch these special cases, thereby=20
avoiding an extra normalization step for each comparison.=E2=80=9D

=E2=80=9CCaseless matching for identifiers can be simplified and optimized =
by=20
using the NFKC_Casefold mapping.  That mapping incorporates internally the=
=20
derived results of iterating case folding and NFKD normalization.=E2=80=9D

> I would like to get back to care for utf8proc as soon as possible, but
> after having been very busy during the last weeks, I got the flu and
> need to recover for a while.

Feel better,
Anders

home help back first fref pref prev next nref lref last post