XML - Well-formedness and Error-handling

Well-formedness and Error-handling

The XML specification defines an XML document as a text that is well-formed, i.e., it satisfies a list of syntax rules provided in the specification. The list is fairly lengthy; some key points are:

  • It contains only properly encoded legal Unicode characters.
  • None of the special syntax characters such as "<" and "&" appear except when performing their markup-delineation roles.
  • The begin, end, and empty-element tags that delimit the elements are correctly nested, with none missing and none overlapping.
  • The element tags are case-sensitive; the beginning and end tags must match exactly. Tag names cannot contain any of the characters !"#$%&'*+,/;<=>?@^`{|}~, nor a space character, and cannot start with -, ., or a numeric digit.
  • There is a single "root" element that contains all the other elements.

The definition of an XML document excludes texts that contain violations of well-formedness rules; they are simply not XML. An XML processor that encounters such a violation is required to report such errors and to cease normal processing. This policy, occasionally referred to as draconian, stands in notable contrast to the behavior of programs that process HTML, which are designed to produce a reasonable result even in the presence of severe markup errors. XML's policy in this area has been criticized as a violation of Postel's law ("Be conservative in what you send; be liberal in what you accept").

A valid XML document is defined in the XML specification as a well-formed XML document which also conforms to the rules of a Document Type Definition (DTD). By extension, the term is also used to refer to documents that conform to rules in other schema languages, such as XML Schema (XSD). This term should not be confused with a well-formed XML document, which is defined as an XML document that has correct XML syntax according to W3C standards.

Read more about this topic:  XML