Contents [hide]
1. Introduction
Some characters are treated specially when processing XML documents. These are the characters which are used to markup XML syntax; when they appear as a part of a document rather than for syntax markup, they need to be appropriately escaped. These characters are:
1 2 3 4 5 | " " Double quote ' ' Single quote < < Left angle bracket > > Right angle bracket & & Ampersand |
2. Character Data
All text that is not markup constitutes character data of the document. Within character data, “&” and “<” must not appear except when used in markup. However “>”, “”” and “‘” can appear directly within character data without having to be encoded.
1 2 | <? xml version = "1.0" ?> < valid >"'></ valid > |
1 2 | <? xml version = "1.0" ?> < invalid >&</ invalid > |
1 2 | <? xml version = "1.0" ?> < invalid ><</ invalid > |
However, “>” cannot appear in the form “]]>” unless it is a part of CDATA ending sections.
1 2 | <? xml version = "1.0" ?> < invalid >]]></ invalid > |
The following is valid as “>” has been encoded properly:
1 2 | <? xml version = "1.0" ?> < valid >]]></ valid > |
2. Attributes
Within attributes, “>” is valid.
1 | < valid attrib=">"></ valid > |
However, ‘”‘ is not valid within double quotes; it must be encoded using “"”.
1 | < valid attrib = """ ></ valid > |
Similarly “‘” is not valid within single quotes. It must be encoded as “'”:
1 | < valid attrib = ''' ></ valid > |
3. Comments
Comments can appear anywhere in a document outside of markup. Within comments, none of the 5 special characters must be escaped or encoded.
1 | < valid > <!-- '"<>& --> </ valid > |
In addition, the string “–” must not appear within comments.
1 | < invalid > <!-- Hello -- there --> </ invalid > |
A consequence of this rule is that a comment must not end with “—>” (three dashes followed by a right-angle-bracket). The following is invalid:
1 | < invalid > <!-- Hello there ---> </ invalid > |
4. Processing Instructions
Processing Instructions (PI) are used to add instructions for XML processing applications. None of the 5 special characters must be encoded within PI statements.
1 | < valid ><? execute <>"'& ?></ valid > |
A further restriction in the case of processing instructions is that the instruction name must not be the string “xml”; this name is reserved for standardization of the XML specification itself.
1 | < invalid ><? xml hello ?></ invalid > |
CData
CDATA sections are used to escape blocks of text containing characters which would otherwise be recognized as markup. This section begin with the string “<![CDATA[” and end with the string “]]>”. Within a CData section, none of the 5 special characters must be encoded.
1 | < valid > <![CDATA[[<greeting>Hello world: <>&'" </greeting>]]> </ valid > |
However, within CData sections, the string “]]>” must not appear except to end the section:
1 | < invalid > <![CDATA[[This string("]]> ") must not appear here]]></ invalid > |
The same can be re-written with two CData sections as follows:
1 | < valid > <![CDATA[[This string("]]]]> <![CDATA[[>") must not appear here]]> </ valid > |
Conclusion
This article demonstrated what the predefined XML entities are and the various circumstances in which they can be used.