[3723] in BarnOwl Developers

home help back first fref pref prev next nref lref last post

Re: utf8proc: case-folded NFD needs extra normalization for

daemon@ATHENA.MIT.EDU (Jan Behrens)
Tue Feb 4 06:37:58 2014

Date: Tue, 4 Feb 2014 12:37:52 +0100
From: Jan Behrens <info2012@public-software-group.org>
To: Anders Kaseorg <andersk@mit.edu>
Cc: barnowl-dev@mit.edu
In-Reply-To: <20140112221202.e036c2b6bf65bdad64bc266d@public-software-group.org>

Dear Anders,

I just updated the project page to mention this issue. Unfortunately,
utf8proc does the UTF-decoding and the Unicode processing in one step,
thus fixing this bug efficently(!) seems to be non-trivial. If you know
of any implementation guidelines how to efficiently handle this corner
case without having to apply normalization several times, I would
appreciate any references.

I would like to get back to care for utf8proc as soon as possible, but
after having been very busy during the last weeks, I got the flu and
need to recover for a while.

I will keep you updated when I get back to address the issue in utf8proc.
Thank you again for reporting this bug.


Kind regards
Jan Behrens


On Sun, 12 Jan 2014 22:12:02 +0100
Jan Behrens <info2012@public-software-group.org> wrote:

> Dear Anders,
> 
> On Thu, 02 Jan 2014 00:12:29 -0500
> Anders Kaseorg <andersk@MIT.EDU> wrote:
> 
> > utf8proc_map(…, UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE) maps
> >    U+03C9 U+0345 U+0342 (ω+◌ͅ+◌͂)
> > to
> >    U+03C9 U+03B9 U+0342 (ω ι+◌͂).
> > 
> > However, according to the Unicode specification for canonical
> > caseless matching, this operation requires a normalization step at
> > _both_ the beginning and the end:
> >    NFD(toCasefold(NFD(X)))
> > and similarly for compatibility caseless matching:
> >    NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
> > “The invocations of normalization before case folding in the
> > preceding definitions are to catch very infrequent edge cases.
> > Normalization is not required before case folding, except for the
> > character U+0345 ◌ͅ COMBINING GREEK YPOGEGRAMMENI and any characters
> > that have it as part of their decomposition, such as U+1FC3 ῃ GREEK
> > SMALL LETTER ETA WITH YPOGEGRAMMENI.”
> > http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
> > 
> > In this case, the NFD form of the input string has the two
> > combining characters swapped:
> >    U+03C9 U+0342 U+0345 (ω+◌͂+◌ͅ)
> > which toCasefold with NFD now maps to
> >    U+03C9 U+0342 U+03B9 (ω+◌͂ ι).
> > 
> > Anders
> 
> Thank you for your notice, I will have a look at that.
> 
> Kind regards
> Jan Behrens
> 

-- 
Public Software Group e. V.
Johannisstr. 12, 10117 Berlin, Germany

www.public-software-group.org
vorstand at public-software-group.org

eingetragen in das Vereinregister
des Amtsgerichtes Charlottenburg
Registernummer: VR 28873 B

Vorstände (einzelvertretungsberechtigt):
Jan Behrens
Axel Kistner
Andreas Nitsche
Björn Swierczek

home help back first fref pref prev next nref lref last post