Load XML into Mysql Using Java

Load XML into MySQL by using Java DOM Parser.

“Never memorize something that you can look up.”
― Albert Einstein

1. Introduction

XML provides the ability to represent hierarchical structure with its parent-child relationships. This enables applications to store structured data in XML for export. Importing this XML data into a database is a bit involved as we shall see in this article. You need to write code to manage the database connection. In addition you need parse the XML and isolate the data that needs to be imported.

Continue reading “Load XML into Mysql Using Java”

How to Modify XML File in Java

1. Introduction

Let us learn how to modify an XML file to remove unwanted information.

One method to remove XML nodes is to use the XML DOM Api to search the XML structure and remove unwanted nodes. While this sounds easy, using the DOM Api is quite hard especially for anything more than trivial searches as this article demonstrates.

An easier method to navigate and remove unwanted Nodes is to use XPath. Even complex search and removal is quite easy as we shall see.

See this article for details on parsing an XML file to obtain the XML Document.

2. Using removeChild() to remove Nodes

Once a particular node is identified for removal, it can be removed quite easily by invoking removeChild() on the parent Node.

static private void removeNode(Node node)
  Node parent = node.getParentNode();
  if ( parent != null ) parent.removeChild(node);

2.1 Saving the Modified XML Document

After the required modifications are done, the XML Document can be saved by using a Transformer.

Initialize the Transformer as shown:

tform = TransformerFactory.newInstance().newTransformer();
tform.setOutputProperty(OutputKeys.INDENT, "yes");
tform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");

Save the modified XML document quite easily using the transformer instance.

tform.transform(new DOMSource(document), new StreamResult(System.out));

3. Searching the XML Document

The XML data set we are using is the TSA airport and checkpoint data available here. We would like to search this data set for the airport in Mobile, AL (identified as <shortcode>MOB</shortcode> in the data set). The following code checks each node whether it matches the query.

static private boolean testNode(Node node)
    NodeList nlist = node.getChildNodes();
    for (int i = 0 ; i < nlist.getLength() ; i++) {
	Node n = nlist.item(i);
	String name = n.getLocalName();
	if ( name != null && name.equals("shortcode") ) {
	    return n.getTextContent().equals("MOB");
    return false;

Collect the nodes to be removed by searching from the document root.

List<Node> nodes = new ArrayList<>();
NodeList nlist = document.getFirstChild().getChildNodes();
for (int i = 0 ; i < nlist.getLength() ; i++) {
    Node node = nlist.item(i);
    if ( testNode(node) ) nodes.add(node);

As you can see from the implementation of testNode(), complex XML search is hard using just the DOM API.

4. Using XPath to Find and Remove Nodes

XPath can be used to easily query for nodes within an XML document.

An initial setup process is required for using XPath to search.

XPathFactory xfact = XPathFactory.newInstance();
XPath xpath = xfact.newXPath();

Here is a method to query for nodes and remove them from the document.

static private void queryRemoveNodes(String xpathStr)
Object res = xpath.evaluate(xpathStr, document, PathConstants.NODESET);
NodeList nlist = (NodeList)res;
for (int i = 0 ; i < nlist.getLength() ; i++) {
    Node node = nlist.item(i);
    Node parent = node.getParentNode();
    if ( parent != null ) parent.removeChild(node);

The previous example to remove the airport for Mobile, AL is written as:

queryRemoveNode("/airports/airport[shortcode = 'MOB']");

The removed node is:

  <name>Mobile Regional</name>

Furthermore, to remove just the <checkpoints> element from the above node, use the following:

queryRemoveNodes("/airports/airport[shortcode = "MOB"]/checkpoints");

Easily remove a bunch of nodes matching an expression.

queryRemoveNodes("/airports/airport[latitude < 20]");


There are two ways of removing nodes from an XML document. The direct method is to search for nodes using the DOM Api and remove them. An easier way is to use XPath to query and remove the nodes matching even complex queries.

How to Extract Data from XML in Java

1. Introduction

In a previous article, we looked into parsing an XML file and converting it to DOM (Document Object Model). The XML DOM object itself is not very useful in an application unless it can be used to extract required data. In this article, let us see how to extract data from XML in Java.

We demonstrate two approaches to extracting data from the XML document. One is a straightforward navigation of the DOM structure to extract fragments of data. Another way is to use XPath to describe and extract the exact information needed with an expression.

2. Accessing the XML Root Element

The most commonly used class in the DOM API is the Node class. All other types of XML artifacts are represented as a Node. These include elements, attributes, text within elements, CDATA, etc.

The most common type of Node we will be concerned with is the element. An element node has attributes, zero or more child elements, text nodes, etc.

A Document is a special type of Node which is obtained as a result of parsing the XML. Use the getFirstChild() method of a Document to get the XML root element.

Node rootElement = document.getFirstChild();

3. Accessing XML Element Children

Access the list of children of an element with the getChildNodes() method. A list of child nodes including elements, text nodes, CDATA, comments, etc are returned. It can be processed like this:

NodeList nlist = node.getChildNodes();
for (int i = 0 ; i < nlist.getLength() ; i++) {
    Node child = nlist.item(i);
    // process the child node here

A child element and its text contents can be checked as follows: A child element <shortcode>PBI</shortcode> is selected for processing here.

String name = child.getLocalName();
if ( name != null && name.equals("shortcode") ) {
    if ( child.getTextContent().equals("PBI") ) {
        // process element here

4. Generating XML Output

Print the whole XML fragment from a node once it is selected. This includes all the child nodes, text, attributes, etc.

Create a Transformer object from the factory object:

Transformer tform = TransformerFactory.newInstance().newTransformer();

Pretty-printing the XML helps in visualizing the structure. You can enable pretty-printing as shown. Here an indentation of 2 spaces is being specified.

tform.setOutputProperty(OutputKeys.INDENT, "yes");
tform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");

And generate the XML output from a Node object for printing:

tform.transform(new DOMSource(node), new StreamResult(System.out));

5. A More Complex Example

Let us look at a more complex example of XML data extraction with some real-world data. The XML data set we are using is the publicly available TSA airport and checkpoint data available here (warning: large file download). This data includes airport information including GPS coordinates and checkpoints.

Let us search this XML data set for information within specified GPS coordinates: locate airports within latitudes range of (25, 30), longitude range of (-90, -80). We search for matching nodes from the root node of the XML.

List<Node> res = new ArrayList<>();
NodeList nlist = rootNode.getChildNodes();
for (int i = 0 ; i < nlist.getLength() ; i++) {
    Node node = nlist.item(i);
    NodeList children = node.getChildNodes();
    boolean foundLat = false, foundLong = false;
    for (int j = 0 ; j < children.getLength() ; j++) {
	Node child = children.item(j);
	String name = child.getLocalName();
	if ( name == null ) continue;
	if ( name.equals("latitude") ) {
	    float lat = Float.parseFloat(child.getTextContent());
	    if ( lat > 25 && lat < 30 ) foundLat = true;
	} else if ( name.equals("longitude") ) {
	    float lng = Float.parseFloat(child.getTextContent());
	    if ( lng > -90 && lng < -80 ) foundLong = true;
    if ( foundLat && foundLong ) res.add(node);

The code above loops through all elements under the root node and selects those children which match the specified conditions: latitude between (25, 30) and longitude between (-90, -80).

As you can see, the code is quite complex and prone to errors. And this is just for finding nodes for some rather simple conditions.

6. Using XPath to Extract Information

Java provides an XPath API which can be used in conjunction with the XML DOM to extract information from XML in an easy manner. XPath in initialized with the application as follows:

XPathFactory xfact = XPathFactory.newInstance();
XPath xpath = xfact.newXPath();

To extract possibly multiple nodes which match an XPath expression, the following method can be used.

Object res = xpath.evaluate(xpathStr, document, XPathConstants.NODESET);

If you know that a single node will match the expression, you can use this method instead.

Object res = xpath.evaluate(xpathStr, document, XPathConstants.NODE);

Maybe you are trying to extract application configuration information from XML? In that case, you might prefer fetching String values in a single call.

String value = xpath.evaluate(xpathStr, document);

7. Some Examples

To compare with the earlier examples, let us find the airport node where <shortcode> equals “PBI“:

String xpathStr = "/airports/airport[shortcode = 'PBI']";
Object res = xpath.evaluate(xpathStr, document, XPathConstants.NODESET);

Results in the output shown below (partially):

  <city>West Palm Beach</city>

And here is the second example: find airports with latitude between (25, 30) and longitude between (-90, -80).

String xpathStr = "/airports/airport[latitude > 25 and latitude < 30 and longitude > -90 and longitude < -80]";
Object res = xpath.evaluate(xpathStr, document, XPathConstants.NODESET);


This article demonstrated a couple of ways of extracting data from XML documents. A direct way is to navigate the DOM structure and perform the extraction. This is error prone and sensitive to changes in XML structure. An easier way is to use XPath expression search to extract required information.

What Characters Need to be Escaped in XML Documents?

1. Introduction

Some characters are treated specially when processing XML documents. These are the characters which are used to markup XML syntax; when they appear as a part of a document rather than for syntax markup, they need to be appropriately escaped. These characters are:

"	&quot;		Double quote
'	&apos;		Single quote
<	&lt;		Left angle bracket
>	&gt;		Right angle bracket
&	&amp;		Ampersand

2. Character Data

All text that is not markup constitutes character data of the document. Within character data, “&” and “<” must not appear except when used in markup. However “>”, “”” and “‘” can appear directly within character data without having to be encoded.

<?xml version="1.0"?>
<?xml version="1.0"?>
<?xml version="1.0"?>

However, “>” cannot appear in the form “]]>” unless it is a part of CDATA ending sections.

<?xml version="1.0"?>

The following is valid as “>” has been encoded properly:

<?xml version="1.0"?>

2. Attributes

Within attributes, “>” is valid.

<valid attrib=">"></valid>

However, ‘”‘ is not valid within double quotes; it must be encoded using “&quot;”.

<valid attrib="&quot;"></valid>

Similarly “‘” is not valid within single quotes. It must be encoded as “&apos;”:

<valid attrib='&apos;'></valid>


Comments can appear anywhere in a document outside of markup. Within comments, none of the 5 special characters must be escaped or encoded.

<valid><!-- '"<>& --></valid>

In addition, the string “–” must not appear within comments.

<invalid><!-- Hello -- there --></invalid>

A consequence of this rule is that a comment must not end with “—>” (three dashes followed by a right-angle-bracket). The following is invalid:

<invalid><!-- Hello there ---></invalid>

4. Processing Instructions

Processing Instructions (PI) are used to add instructions for XML processing applications. None of the 5 special characters must be encoded within PI statements.

<valid><?execute <>"'& ?></valid>

A further restriction in the case of processing instructions is that the instruction name must not be the string “xml”; this name is reserved for standardization of the XML specification itself.

<invalid><?xml hello ?></invalid>


CDATA sections are used to escape blocks of text containing characters which would otherwise be recognized as markup. This section begin with the string “<![CDATA[” and end with the string “]]>”. Within a CData section, none of the 5 special characters must be encoded.

<valid><![CDATA[[<greeting>Hello world: <>&'" </greeting>]]></valid>

However, within CData sections, the string “]]>” must not appear except to end the section:

<invalid><![CDATA[[This string("]]>") must not appear here]]></invalid>

The same can be re-written with two CData sections as follows:

<valid><![CDATA[[This string("]]]]><![CDATA[[>") must not appear here]]></valid>


This article demonstrated what the predefined XML entities are and the various circumstances in which they can be used.

Parsing XML in Python

1. Introduction

XML can be parsed in python using the xml.etree.ElementTree library. This article shows you how to parse and extract elements, attributes and text from XML using this library.

While this library is easy to use, it loads the whole XML document into memory and hence may not be suitable for processing large XML files or where run-time memory is an issue.

2. Sample XML

Here is a short snippet of the sample XML we will be working with.

<?xml version="1.0"?>
  <book id="bk101">
    <author>Gambardella, Matthew</author>
    <title>XML Developer's Guide</title>
    <description>An in-depth look at creating applications 
    with XML.</description>
  <book id="bk102">
    <author>Ralls, Kim</author>
    <title>Midnight Rain</title>

3. Getting Started

3.1 Parsing an XML File

It is very simple to parse an XML using ElementTree. The following returns an ElementTree object which contains all the XML artifacts.

import xml.etree.ElementTree as ET

tree = ET.parse('catalog.xml')

Once you have the ElementTree object, you can get the root Element object using the tree.getroot() method:

root = tree.getroot()

3.2 Parsing an XML String

We use the ElementTree.fromstring() method to parse an XML string. The method returns root Element directly: a subtle difference compared with the ElementTree.parse() method which returns an ElementTree object.

print ET.fromstring('<a><b>1</b>2</a>')
# prints "<Element 'a' at ...>"

4. Working with Elements

From the Element object, you can get the XML tag name using the tag property.

print root.tag    # outputs "catalog"

Get a list of child Elements from any Element object using element.findall(‘*’). You can find specific children by name by passing the name of the tag e.g. element.findall(‘book’).

For example, the following code recursively processes all the elements in the XML and prints the name of the tag.

def show(elem):
    print elem.tag
    for child in elem.findall('*'):


The above code can be modified to show nicely indented output of the tag names:

def show(elem, indent = 0);
    print ' ' * indent + elem.tag
    for child in elem.findall('*'):
         show(elem, indent + 1)


To find a single element by name, use elem.find(tagName):

print root.find('book').tag    # prints "book"

5. Working with Attributes

XML attributes can be extracted from an Element object using the element.items() method which returns a sequence of name, value pairs. (The name-value pairs are returned in random order, not in the order they appear in the XML.)

for attrName, attrValue in elem.items():
    print attrName + '=' + attrValue

To retrieve a single attribute value by name, use the elem.get(attrName) method:

print root.find('book').get('id')    # prints "bk101"

Get a list of all attribute names defined on the element using elem.keys(). Returns an empty list if no attributes are defined.

print root.find('book').keys()
# prints ['id']

To get all the attributes as a python dictionary, use the elem.attrib property:

print root.find('book').attrib
# prints {'id': 'bk101'}

6. Retrieving Element Text Content

You can retrieve an element’s text content using the elem.text property as follows:

print ET.fromstring('<a>Hello<b>1</b>2</a>').text
# prints "Hello"

print ET.fromstring('<a>Hello<b>1</b>2</a>').find('b').text
#prints "1"

Text appearing after the element’s end tag is retrieved using elem.tail property:

print ET.fromstring('<a>Hello<b>1</b>2</a>').find('b').tail
# prints "2"

7. Using Path to Extract Content

Some of the Element object methods support extracting content by using a syntax similar to XPath:

Retrieve a descendant element:

print root.find('book/author').text
# prints "Gambardella, Matthew"

To obtain the text of the first matching element, use the elem.findtext() method as follows:

print root.findtext('book/author')
#prints "Gambardella, Matthew"

Retrieve and process a list of matching elements using elem.findall():

for e in root.findall('book/author'):
    print e.text

# prints the following
Gambardella, Matthew
Ralls, Kim
Corets, Eva
Corets, Eva
Corets, Eva
Randall, Cynthia
Thurman, Paula
Knorr, Stefan
Kress, Peter
O'Brien, Tim
O'Brien, Tim
Galos, Mike

Find a specific element by position. (Position indexes start with 1).

print root.find('book[2]/author').text
# prints "Ralls, Kim"

Here is an example to find and concatenate all text content using a reduce operation:

reduce(lambda x, y: x + '|' + y.text, root.findall("book/author"), '')

# prints "|Gambardella, Matthew|Ralls, Kim|Corets, Eva|Corets, Eva|Corets, Eva|Randall, Cynthia|Thurman, Paula|Knorr, Stefan|Kress, Peter|O'Brien, Tim|O'Brien, Tim|Galos, Mike"


This article demonstrated some aspects of parsing XML with python. We showed how to parse an XML file or an XML string and extract elements, attributes and text content.

See Also

What does <![CDATA[]]> in XML mean?


CDATA stands for Character Data. A CDATA section in XML is used to escape text containing characters which would otherwise be recognized as markup. It can appear anywhere character data can occur.


A CDATA section is marked up starting with “<![CDATA[” and ending with “]]>“. Any character data (other than “]]>“) can appear within the section without needing to be escaped. For example, angle brackets (<>) and ampersands (&) which indicate XML markup need not be escaped within a CDATA section.

For example, the following is a CDATA section. The angle brackets surrounding “greeting” and “/greeting” need not be escaped. When processing this XML, the parser receives the text “<greeting>Hello, world!</greeting>” as character data and not as markup.

<![CDATA[<greeting>Hello, world!</greeting>]]>

In addition, parameter entity references are recognized within CDATA sections. For example, assume the following parameter entity is defined:

<!ENTITY AnEntity "Sample entity data here">

Within the following CDATA section, the entity reference “%AnEntity” is recognized and the value is replaced within character data passed to the XML processor.

<![CDATA[My value is %AnEntity]]>


A CDATA section may not be nested inside another because “]]>” may not appear directly except to end the CDATA section. The following is invalid:

<![CDATA[We need the ending "]]>" here.]]>

Instead the above can be written in two sections as follows:

<![CDATA[We need the ending "]]]]><![CDATA[>" here.]]>

Attribute Value

Within a Document-Type Definition (DTD), an attribute value may be declared to be of type CDATA as follows:

          src CDATA #REQUIRED>

This declaration states that an img element must have a src attribute whose value type is CDATA.

Where is it used?

CDATA sections are used when larger amounts of verbatim text need to appear within XML documents and processed verbatim. Smaller quantities of such text can be properly encoded to escape the XML characters, but for larger text, it helps to preserve the meaning of the text without having to do so.