Unicode and HTML - HTML Document Characters

HTML Document Characters

Web pages are typically HTML or XHTML documents. Both types of documents consist, at a fundamental level, of characters, which are graphemes and grapheme-like units, independent of how they manifest in computer storage systems and networks.

An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML document character set: a character repertoire wherein each character is assigned a unique, non-negative integer code point. This set is defined in the HTML 4.0 DTD, which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by Unicode and ISO/IEC 10646: the Universal Character Set (UCS).

Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an XML document, which, while not having an explicit "document character" layer of abstraction, nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author.

Regardless of whether the document is HTML or XHTML, when stored on a file system or transmitted over a network, the document's characters are encoded as a sequence of bit octets (bytes) according to a particular character encoding. This encoding may either be a Unicode Transformation Format, like UTF-8, that can directly encode any Unicode character, or a legacy encoding, like Windows-1252, that cannot. However, even when using encodings that do not support all Unicode characters, the encoded document may make use of numeric character references. For example ☺ (☺) is used to indicate a smiling face character in the Unicode character set.

Read more about this topic:  Unicode And HTML

Famous quotes containing the words document and/or characters:

    ... research is never completed ... Around the corner lurks another possibility of interview, another book to read, a courthouse to explore, a document to verify.
    Catherine Drinker Bowen (1897–1973)

    The Nature of Familiar Letters, written, as it were, to the Moment, while the Heart is agitated by Hopes and Fears, on Events undecided, must plead an Excuse for the Bulk of a Collection of this Kind. Mere Facts and Characters might be comprised in a much smaller Compass: But, would they be equally interesting?
    Samuel Richardson (1689–1761)