A Note on Unicode Case-Folding and NormalizationA Note on Unicode Case-Folding and Normalization — How to handle line input |
With all of these Unicode transformations hovering about, an author might reasonably ask about the right way to handle line input.
Our recommendation is: call glk_buffer_to_lower_case_uni()
, followed by glk_buffer_canon_normalize_uni()
, and then parse the result.
The parsing process should of course match against strings that have been put through the same process.
The Unicode spec (chapter 3.13) gives a different, three-step process: decomposition, case-folding, and decomposition again. Our recommendation comes through a series of practical compromises:
The initial decomposition is only necessary because of a historical error in the Unicode spec: character 0x0345 (COMBINING GREEK YPOGEGRAMMENI) behaves inconsistently. We ignore this case, and skip this step.
Case-folding is a slightly different operation from lower-casing.
(Case-folding splits some combined characters, so that, for example, “ß” can match both “ss” and “SS”.)
However, Glk does not currently offer a case-folding function.
We substitute glk_buffer_to_lower_case_uni()
.
I'm not sure why the spec recommends decomposition (glk_buffer_canon_decompose_uni()
) rather than glk_buffer_canon_normalize_uni()
.
However, composed characters are the norm in source code, and therefore in compiled Inform game files.
If we specified decomposition, the compiler would have to do extra work; also, the standard Inform dictionary table (with its fixed word length) would store fewer useful characters.
Therefore, we substitute glk_buffer_canon_normalize_uni()
.
We may revisit these recommendations in future versions of the spec.