Encoding Forms of The Universal Character Set
ISO 10646 defines several character encoding forms for the Universal Character Set. The simplest, UCS-2, uses a single code value (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP. Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.
The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".
Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.
Currently, the dominant UCS encoding is UTF-8, which is a variable-width encoding designed for backward compatibility with ASCII, and for avoiding the complications of endianness and byte-order marks in UTF-16 and UTF-32. More than half of all Web pages are encoded in UTF-8. The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8. The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8. It is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.
See also Comparison of Unicode encodings.
Read more about this topic: Universal Character Set
Famous quotes containing the words forms, universal, character and/or set:
“There is a continual exchange of ideas between all minds of a generation. Journalists, popular novelists, illustrators, and cartoonists adapt the truths discovered by the powerful intellects for the multitude. It is like a spiritual flood, like a gush that pours into multiple cascades until it forms the great moving sheet of water that stands for the mentality of a period.”
—Auguste Rodin (18491917)
“So having said, a while he stood, expecting
Their universal shout and high applause
To fill his ear; when contrary, he hears,
On all sides, from innumerable tongues
A dismal universal hiss, the sound
Of public scorn.”
—John Milton (16081674)
“Eccentricity: strength of character doubling back on itself.”
—Mason Cooley (b. 1927)
“Every woman is supposed to have the same set of motives, or else to be a monster.”
—George Eliot [Mary Ann (or Marian)