What Characters Need to be Escaped in XML Documents?

1. Introduction

Some characters are treated specially when processing XML documents. These are the characters which are used to markup XML syntax; when they appear as a part of a document rather than for syntax markup, they need to be appropriately escaped. These characters are:

"	"		Double quote
'	'		Single quote
<	&lt;		Left angle bracket
>	&gt;		Right angle bracket
&	&amp;		Ampersand

2. Character Data

All text that is not markup constitutes character data of the document. Within character data, “&” and “<” must not appear except when used in markup. However “>”, “”” and “‘” can appear directly within character data without having to be encoded.

<?xml version="1.0"?>
<valid>"'></valid>
<?xml version="1.0"?>
<invalid>&</invalid>
<?xml version="1.0"?>
<invalid><</invalid>

However, “>” cannot appear in the form “]]>” unless it is a part of CDATA ending sections.

<?xml version="1.0"?>
<invalid>]]></invalid>

The following is valid as “>” has been encoded properly:

<?xml version="1.0"?>
<valid>]]&gt;</valid>

2. Attributes

Within attributes, “>” is valid.

<valid attrib=">"></valid>

However, ‘”‘ is not valid within double quotes; it must be encoded using “&quot;”.

<valid attrib="&quot;"></valid>

Similarly “‘” is not valid within single quotes. It must be encoded as “&apos;”:

<valid attrib='&apos;'></valid>

3. Comments

Comments can appear anywhere in a document outside of markup. Within comments, none of the 5 special characters must be escaped or encoded.

<valid><!-- '"<>& --></valid>

In addition, the string “–” must not appear within comments.

<invalid><!-- Hello -- there --></invalid>

A consequence of this rule is that a comment must not end with “—>” (three dashes followed by a right-angle-bracket). The following is invalid:

<invalid><!-- Hello there ---></invalid>

4. Processing Instructions

Processing Instructions (PI) are used to add instructions for XML processing applications. None of the 5 special characters must be encoded within PI statements.

<valid><?execute <>"'& ?></valid>

A further restriction in the case of processing instructions is that the instruction name must not be the string “xml”; this name is reserved for standardization of the XML specification itself.

<invalid><?xml hello ?></invalid>

CData

CDATA sections are used to escape blocks of text containing characters which would otherwise be recognized as markup. This section begin with the string “<![CDATA[” and end with the string “]]>”. Within a CData section, none of the 5 special characters must be encoded.

<valid><![CDATA[[<greeting>Hello world: <>&'" </greeting>]]></valid>

However, within CData sections, the string “]]>” must not appear except to end the section:

<invalid><![CDATA[[This string("]]>") must not appear here]]></invalid>

The same can be re-written with two CData sections as follows:

<valid><![CDATA[[This string("]]]]><![CDATA[[>") must not appear here]]></valid>

Conclusion

This article demonstrated what the predefined XML entities are and the various circumstances in which they can be used.

Leave a Reply

Your email address will not be published. Required fields are marked *