[3723] in BarnOwl Developers
Re: utf8proc: case-folded NFD needs extra normalization for
daemon@ATHENA.MIT.EDU (Jan Behrens)
Tue Feb 4 06:37:58 2014
Date: Tue, 4 Feb 2014 12:37:52 +0100
From: Jan Behrens <info2012@public-software-group.org>
To: Anders Kaseorg <andersk@mit.edu>
Cc: barnowl-dev@mit.edu
In-Reply-To: <20140112221202.e036c2b6bf65bdad64bc266d@public-software-group.org>
Dear Anders,
I just updated the project page to mention this issue. Unfortunately,
utf8proc does the UTF-decoding and the Unicode processing in one step,
thus fixing this bug efficently(!) seems to be non-trivial. If you know
of any implementation guidelines how to efficiently handle this corner
case without having to apply normalization several times, I would
appreciate any references.
I would like to get back to care for utf8proc as soon as possible, but
after having been very busy during the last weeks, I got the flu and
need to recover for a while.
I will keep you updated when I get back to address the issue in utf8proc.
Thank you again for reporting this bug.
Kind regards
Jan Behrens
On Sun, 12 Jan 2014 22:12:02 +0100
Jan Behrens <info2012@public-software-group.org> wrote:
> Dear Anders,
>
> On Thu, 02 Jan 2014 00:12:29 -0500
> Anders Kaseorg <andersk@MIT.EDU> wrote:
>
> > utf8proc_map(…, UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE) maps
> > U+03C9 U+0345 U+0342 (ω+◌ͅ+◌͂)
> > to
> > U+03C9 U+03B9 U+0342 (ω ι+◌͂).
> >
> > However, according to the Unicode specification for canonical
> > caseless matching, this operation requires a normalization step at
> > _both_ the beginning and the end:
> > NFD(toCasefold(NFD(X)))
> > and similarly for compatibility caseless matching:
> > NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
> > “The invocations of normalization before case folding in the
> > preceding definitions are to catch very infrequent edge cases.
> > Normalization is not required before case folding, except for the
> > character U+0345 ◌ͅ COMBINING GREEK YPOGEGRAMMENI and any characters
> > that have it as part of their decomposition, such as U+1FC3 ῃ GREEK
> > SMALL LETTER ETA WITH YPOGEGRAMMENI.”
> > http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
> >
> > In this case, the NFD form of the input string has the two
> > combining characters swapped:
> > U+03C9 U+0342 U+0345 (ω+◌͂+◌ͅ)
> > which toCasefold with NFD now maps to
> > U+03C9 U+0342 U+03B9 (ω+◌͂ ι).
> >
> > Anders
>
> Thank you for your notice, I will have a look at that.
>
> Kind regards
> Jan Behrens
>
--
Public Software Group e. V.
Johannisstr. 12, 10117 Berlin, Germany
www.public-software-group.org
vorstand at public-software-group.org
eingetragen in das Vereinregister
des Amtsgerichtes Charlottenburg
Registernummer: VR 28873 B
Vorstände (einzelvertretungsberechtigt):
Jan Behrens
Axel Kistner
Andreas Nitsche
Björn Swierczek