Variable-width Encoding - General Structure

General Structure

Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: singletons, which consist of a single unit, lead units, which come first in a multiunit sequence, and trail units, which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character.

For example, the four character string "I♥NY" is encoded in UTF-8 like this (shown as hexadecimal byte values): 49 E2 99 A5 4E 59. Of the six units in that sequence, 49, 4E, and 59 are singletons (for I, N, and Y), E2 is a lead unit and 99 and A5 are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.

UTF-8 makes it easy for a program to identify the three sorts of units as they are kept apart. Older variable-width encodings are typically not so well designed, as in them the trail and lead units may use the same values, and in some all three sorts use overlapping values. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally different. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and (provided the decoder is well written) the corruption or loss of one unit corrupts only one character.

Read more about this topic:  Variable-width Encoding

Famous quotes containing the words general and/or structure:

    In democratic ages men rarely sacrifice themselves for another, but they show a general compassion for all the human race. One never sees them inflict pointless suffering, and they are glad to relieve the sorrows of others when they can do so without much trouble to themselves. They are not disinterested, but they are gentle.
    Alexis de Tocqueville (1805–1859)

    Agnosticism is a perfectly respectable and tenable philosophical position; it is not dogmatic and makes no pronouncements about the ultimate truths of the universe. It remains open to evidence and persuasion; lacking faith, it nevertheless does not deride faith. Atheism, on the other hand, is as unyielding and dogmatic about religious belief as true believers are about heathens. It tries to use reason to demolish a structure that is not built upon reason.
    Sydney J. Harris (1917–1986)