Unicode and HTML - HTML Document Characters

HTML Document Characters

Web pages are typically HTML or XHTML documents. Both types of documents consist, at a fundamental level, of characters, which are graphemes and grapheme-like units, independent of how they manifest in computer storage systems and networks.

An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML document character set: a character repertoire wherein each character is assigned a unique, non-negative integer code point. This set is defined in the HTML 4.0 DTD, which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by Unicode and ISO/IEC 10646: the Universal Character Set (UCS).

Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an XML document, which, while not having an explicit "document character" layer of abstraction, nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author.

Regardless of whether the document is HTML or XHTML, when stored on a file system or transmitted over a network, the document's characters are encoded as a sequence of bit octets (bytes) according to a particular character encoding. This encoding may either be a Unicode Transformation Format, like UTF-8, that can directly encode any Unicode character, or a legacy encoding, like Windows-1252, that cannot. However, even when using encodings that do not support all Unicode characters, the encoded document may make use of numeric character references. For example ☺ (☺) is used to indicate a smiling face character in the Unicode character set.

Read more about this topic:  Unicode And HTML

Famous quotes containing the words document and/or characters:

    ... research is never completed ... Around the corner lurks another possibility of interview, another book to read, a courthouse to explore, a document to verify.
    Catherine Drinker Bowen (1897–1973)

    His leanings were strictly lyrical, descriptions of nature and emotions came to him with surprising facility, but on the other hand he had a lot of trouble with routine items, such as, for instance, the opening and closing of doors, or shaking hands when there were numerous characters in a room, and one person or two persons saluted many people.
    Vladimir Nabokov (1899–1977)