A Note on Unicode Case-Folding and Normalization

A Note on Unicode Case-Folding and Normalization — How to handle line input

Description

With all of these Unicode transformations hovering about, an author might reasonably ask about the right way to handle line input. Our recommendation is: call glk_buffer_to_lower_case_uni(), followed by glk_buffer_canon_normalize_uni(), and then parse the result. The parsing process should of course match against strings that have been put through the same process.

The Unicode spec (chapter 3.13) gives a different, three-step process: decomposition, case-folding, and decomposition again. Our recommendation comes through a series of practical compromises:

  • The initial decomposition is only necessary because of a historical error in the Unicode spec: character 0x0345 (COMBINING GREEK YPOGEGRAMMENI) behaves inconsistently. We ignore this case, and skip this step.

  • Case-folding is a slightly different operation from lower-casing. (Case-folding splits some combined characters, so that, for example, “ß” can match both “ss” and “SS”.) However, Glk does not currently offer a case-folding function. We substitute glk_buffer_to_lower_case_uni().

  • I'm not sure why the spec recommends decomposition (glk_buffer_canon_decompose_uni()) rather than glk_buffer_canon_normalize_uni(). However, composed characters are the norm in source code, and therefore in compiled Inform game files. If we specified decomposition, the compiler would have to do extra work; also, the standard Inform dictionary table (with its fixed word length) would store fewer useful characters. Therefore, we substitute glk_buffer_canon_normalize_uni().

We may revisit these recommendations in future versions of the spec.