Document Type Definition - XML DTDs and Schema Validation

XML DTDs and Schema Validation

The XML DTD syntax is one of several XML schema languages. However, many of the schema languages do not fully replace the XML DTD. Notably, the XML DTD allows defining entities and notations that have no direct equivalents in DTD-less XML (because internal entities and parsable external entities are not part of XML schema languages, and because other unparsed external entities and notations have no simple equivalent mappings in most XML schema languages).

Most XML schema languages are only replacements for element declarations and attribute list declarations, in such a way that it becomes possible to parse XML documents with non-validating XML parsers (if the only purpose of the external DTD subset was to define the schema). In addition, documents for these XML schema languages have to be parsed separately, so validating the schema of XML documents in pure standalone mode is not really possible with these languages: the document type declaration remains necessary for at least identifying (with a XML Catalog) the schema used in the parsed XML document and that will be validated in another language.

A common misconception holds that a non-validating XML parser does not have to read document type declarations, when in fact, the document type declarations must still be scanned for correct syntax as well as validity of declarations, and the parser must still parse all entity declarations in the internal subset, and substitute the replacement texts of internal entities occurring anywhere in the document type declaration or in the document body.

A non-validating parser may, however, elect not to read parsable external entities (including the external subset), and does not have to honor the content model restrictions defined in element declarations and in attribute list declarations.

If the XML document depends on parsable external entities (including the specified external subset, or parsable external entities declared in the internal subset), it should assert standalone="no" in its XML declaration. Identification of the validating DTD may be performed by the use of XML Catalogs, in order to retrieve its specified external subset.

In the example below, the XML document is declared with standalone="no" because it has an external subset in its document type declaration:

If the XML document type declaration includes any SYSTEM identifier for the external subset, it can't be safely processed as standalone: the URI should be retrieved, otherwise there may be unknown named character entities whose definition may be needed to correctly parse the effective XML syntax in the internal subset or in the document body (the XML syntax parsing is normally performed after the substitution of all named entities, excluding the five entities that are predefined in XML and that are implicitly substituted after parsing the XML document into lexical tokens). If it just includes any PUBLIC identifier, it may be processed as standalone, if the XML processor knows this PUBLIC identifier in its local catalog from where it can retrieve an associated DTD entity.

Read more about this topic:  Document Type Definition