Mapping of Unicode Characters

Mapping Of Unicode Characters

Unicode’s Universal Character Set (UCS) has a potential capacity to support over 1 million characters. Each UCS character is mapped to a code point, which is an integer between 0 and 1,114,111, used to represent each character within the internal logic of text processing software (1,114,112 = 220 + 216 or 17 × 216, or hexadecimal 110000 code points).

As of Unicode 6.2, released in September 2012, 249,764 (22.4%) of these code points are assigned, including 110,182 (9.9%) encoded characters, 137,468 (12.3%) reserved for private use, 2,048 for surrogates, and 66 designated noncharacters, leaving 864,348 (77.6%) unassigned. The number of encoded characters is made up as follows:

  • 109,976 graphical characters (some of which are invisible, but are still counted as graphical)
  • 206 special purpose characters for control and formatting.

(See the summary table for a more detailed breakdown).

Unicode characters can be categorized in many ways. Every character is assigned a script or a symbol (though many are assigned the common or inherited scripts where they inherit the script from the adjacent character). In Unicode a script is a coherent writing system that includes letters but also may include script-specific punctuation, diacritic and other marks and numerals and symbols. A single script supports one or more languages. Symbols, including control characters, are relevant for their meaning, not their speech.

Characters are assigned in blocks of characters. A block is a single group of code points. Every character is also assigned a general category and subcategory. The general categories are: letter, mark, number, punctuation, symbol, or control (in other words a formatting or non-graphical character).

The blocks of characters are assigned according to various planes. Most characters are currently assigned to the first plane: the Basic Multilingual Plane. This is to help ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two octet bytes. The characters outside the first plane usually have very specialized or rare use.

The first 256 code points correspond with those of ISO 8859-1, the most popular 8-bit character encoding in the Western world. As a result, the first 128 characters are also identical to ASCII. Though Unicode refers to these as a Latin script block, these two blocks contain many characters that are commonly useful outside of the Latin script. In general, not all characters in a given block need be of the same script, and a given script can occur in several different blocks.

Read more about Mapping Of Unicode Characters:  Planes, Special-purpose Characters, Whitespace Characters, Private Use Characters, Special Code Points, Character Properties

Famous quotes containing the word characters:

    Philosophy is written in this grand book—I mean the universe—
    which stands continually open to our gaze, but it cannot be understood unless one first learns to comprehend the language and interpret the characters in which it is written. It is written in the language of mathematics, and its characters are triangles, circles, and other geometrical figures, without which it is humanly impossible to understand a single word of it.
    Galileo Galilei (1564–1642)