Universal Character Set - Encoding Forms of The Universal Character Set

Encoding Forms of The Universal Character Set

ISO 10646 defines several character encoding forms for the Universal Character Set. The simplest, UCS-2, uses a single code value (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP. Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.

The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".

Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.

Currently, the dominant UCS encoding is UTF-8, which is a variable-width encoding designed for backward compatibility with ASCII, and for avoiding the complications of endianness and byte-order marks in UTF-16 and UTF-32. More than half of all Web pages are encoded in UTF-8. The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8. The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8. It is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.

See also Comparison of Unicode encodings.

Read more about this topic:  Universal Character Set

Famous quotes containing the words forms, universal, character and/or set:

    There is a continual exchange of ideas between all minds of a generation. Journalists, popular novelists, illustrators, and cartoonists adapt the truths discovered by the powerful intellects for the multitude. It is like a spiritual flood, like a gush that pours into multiple cascades until it forms the great moving sheet of water that stands for the mentality of a period.
    Auguste Rodin (1849–1917)

    So having said, a while he stood, expecting
    Their universal shout and high applause
    To fill his ear; when contrary, he hears,
    On all sides, from innumerable tongues
    A dismal universal hiss, the sound
    Of public scorn.
    John Milton (1608–1674)

    Eccentricity: strength of character doubling back on itself.
    Mason Cooley (b. 1927)

    Every woman is supposed to have the same set of motives, or else to be a monster.
    George Eliot [Mary Ann (or Marian)