Unicode String Normalization

Unicode String Normalization — Combining characters

Functions

Includes

#include <libchimara/glk.h>

Description

Comparing Unicode strings is difficult, because there can be several ways to represent a piece of text as a Unicode string. For example, the one-character string “è” (an accented “e”) will be displayed the same as the two-character string containing “e” followed by Unicode character 0x0300 (COMBINING GRAVE ACCENT). These strings should be considered equal.

Therefore, a Glk program that accepts line input should convert its text to a normalized form before parsing it. These functions offer those conversions. The algorithms are defined by the Unicode spec (chapter 3.7) and Unicode Standard Annex &num;15.

Functions

glk_buffer_canon_decompose_uni ()

glui32
glk_buffer_canon_decompose_uni (glui32 *buf,
                                glui32 len,
                                glui32 numchars);

This transforms a string into its canonical decomposition (“Normalization Form D”). Effectively, this takes apart multipart characters into their individual parts. For example, it would convert “è” (character 0xE8, an accented “e”) into the two-character string containing “e” followed by Unicode character 0x0300 (COMBINING GRAVE ACCENT). If a single character has multiple accent marks, they are also rearranged into a standard order.

Parameters

buf

A character array in UCS-4.

 

len

Available length of buf .

 

numchars

Number of characters in buf .

 

Returns

The number of characters in buf after decomposition.


glk_buffer_canon_normalize_uni ()

glui32
glk_buffer_canon_normalize_uni (glui32 *buf,
                                glui32 len,
                                glui32 numchars);

This transforms a string into its canonical decomposition and recomposition (“Normalization Form C”). Effectively, this takes apart multipart characters, and then puts them back together in a standard way. For example, this would convert the two-character string containing “e” followed by Unicode character 0x0300 (COMBINING GRAVE ACCENT) into the one-character string “è” (character 0xE8, an accented “e”).

The canon_normalize function includes decomposition as part of its implementation. You never have to call both functions on the same string.

Both of these functions are idempotent.

These functions provide two length arguments because a string of Unicode characters may expand when it is transformed. The len argument is the available length of the buffer; numchars is the number of characters in the buffer initially. (So numchars must be less than or equal to len . The contents of the buffer after numchars do not affect the operation.)

The functions return the number of characters after transformation. If this is greater than len , the characters in the array will be safely truncated at len , but the true count will be returned. (The contents of the buffer after the returned count are undefined.)

The Unicode spec also defines stronger forms of these functions, called “compatibility decomposition and recomposition” (“Normalization Form KD” and “Normalization Form KC”.) These do all of the accent-mangling described above, but they also transform many other obscure Unicode characters into more familiar forms. For example, they split ligatures apart into separate letters. They also convert Unicode display variations such as script letters, circled letters, and half-width letters into their common forms.

The Glk spec does not currently provide these stronger transformations. Glk's expected use of Unicode normalization is for line input, and an OS facility for line input will generally not produce these alternate character forms (unless the user goes out of his way to type them). Therefore, the need for these transformations does not seem to be worth the extra data table space.

Parameters

buf

A character array in UCS-4.

 

len

Available length of buf .

 

numchars

Number of characters in buf .

 

Returns

the number of characters in buf after normalization.