Bi-directional Text - Unicode Support

Unicode Support

Bidirectional script support is the capability of a computer system to correctly display bi-directional text. The term is often shortened to the jargon term BiDi or bidi.

Early computer installations were designed only to support a single writing system, typically for left-to-right scripts based on the Latin alphabet only. Adding new character sets and character encodings enabled a number of other left-to-right scripts to be supported, but did not easily support right-to-left scripts such as Arabic or Hebrew, and mixing the two was not practical. Right-to-left scripts were introduced through encodings like ISO/IEC 8859-6 and ISO/IEC 8859-8, storing the letters (usually) in writing and reading order. It is possible to simply flip the left-to-right display order to a right-to-left display order, but doing this sacrifices the ability to correctly display left-to-right scripts. With bidirectional script support, it is possible to mix scripts from different scripts on the same page, regardless of writing direction.

In particular, the Unicode standard provides foundations for complete BiDi support, with detailed rules as to how mixtures of left-to-right and right-to-left scripts are to be encoded and displayed.

In Unicode encoding, all non-punctuation characters are stored in writing order. This means that the writing direction of characters is stored within the characters. If this is the case, the character is called "strong". Punctuation characters however, can appear in both LTR and RTL scripts. They are called "weak" characters because they do not contain any directional information. So it is up to the software to decide in which direction these "weak" characters will be placed. Sometimes (in mixed-directions text) this leads to display errors caused by the BiDi algorithm, which examines the text, identifies LTR and RTL strong characters and assigns a direction to weak characters, according to certain rules.

In the algorithm, each sequence of concatenated strong characters is called a "run". A weak character that is located between two strong characters with the same orientation will inherit their orientation. A weak character that is located between two strong characters with a different writing direction, will inherit the main context's writing direction (in an LTR document the character will become LTR, in an RTL document, it will become RTL). If a "weak" character is followed by another "weak" character, the algorithm will look at the first neighbouring "strong" character. Sometimes this leads to unintentional display errors. These errors are corrected or prevented with "pseudo-strong" characters. Such Unicode control characters are called marks. The mark (U+200E left-to-right mark (HTML: ‎ ‎ LRM) or U+200F right-to-left mark (HTML: ‏ ‏ RLM)) is to be inserted into a location to make an enclosed weak character inherit its writing direction.

For example, to correctly display the U+2122 ™ trade mark sign for an English name brand (LTR) in an Arabic (RTL) passage, an LRM mark is inserted after the trademark symbol if the symbol is not followed by LTR text. If the LRM mark is not added, the weak character ™ will be neighbored by a strong LTR character and a strong RTL character. Hence, in an RTL context, it will be considered to be RTL, and displayed in an incorrect order.

Possible BiDi-types of a character, to be used by the BiDi algorithm, are: collapsed

Read more about this topic:  Bi-directional Text

Famous quotes containing the word support:

    There are so many intellectual and moral angels battling for rationalism, good citizenship, and pure spirituality; so many and such eminent ones, so very vocal and authoritative! The poor devil in man needs all the support and advocacy he can get. The artist is his natural champion. When an artist deserts to the side of the angels, it is the most odious of treasons.
    Aldous Huxley (1894–1963)