How to Extract Data from XML in Java

1. Introduction

In a previous article, we looked into parsing an XML file and converting it to DOM (Document Object Model). The XML DOM object itself is not very useful in an application unless it can be used to extract required data. In this article, let us see how to extract data from XML in Java.

We demonstrate two approaches to extracting data from the XML document. One is a straightforward navigation of the DOM structure to extract fragments of data. Another way is to use XPath to describe and extract the exact information needed with an expression.

2. Accessing the XML Root Element

The most commonly used class in the DOM API is the Node class. All other types of XML artifacts are represented as a Node. These include elements, attributes, text within elements, CDATA, etc.

The most common type of Node we will be concerned with is the element. An element node has attributes, zero or more child elements, text nodes, etc.

A Document is a special type of Node which is obtained as a result of parsing the XML. Use the getFirstChild() method of a Document to get the XML root element.

Node rootElement = document.getFirstChild();

3. Accessing XML Element Children

Access the list of children of an element with the getChildNodes() method. A list of child nodes including elements, text nodes, CDATA, comments, etc are returned. It can be processed like this:

NodeList nlist = node.getChildNodes();
for (int i = 0 ; i < nlist.getLength() ; i++) {
    Node child = nlist.item(i);
    // process the child node here
}

A child element and its text contents can be checked as follows: A child element <shortcode>PBI</shortcode> is selected for processing here.

String name = child.getLocalName();
if ( name != null && name.equals("shortcode") ) {
    if ( child.getTextContent().equals("PBI") ) {
        // process element here
    }
}

4. Generating XML Output

Print the whole XML fragment from a node once it is selected. This includes all the child nodes, text, attributes, etc.

Create a Transformer object from the factory object:

Transformer tform = TransformerFactory.newInstance().newTransformer();

Pretty-printing the XML helps in visualizing the structure. You can enable pretty-printing as shown. Here an indentation of 2 spaces is being specified.

tform.setOutputProperty(OutputKeys.INDENT, "yes");
tform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");

And generate the XML output from a Node object for printing:

tform.transform(new DOMSource(node), new StreamResult(System.out));

5. A More Complex Example

Let us look at a more complex example of XML data extraction with some real-world data. The XML data set we are using is the publicly available TSA airport and checkpoint data available here (warning: large file download). This data includes airport information including GPS coordinates and checkpoints.

Let us search this XML data set for information within specified GPS coordinates: locate airports within latitudes range of (25, 30), longitude range of (-90, -80). We search for matching nodes from the root node of the XML.

List<Node> res = new ArrayList<>();
NodeList nlist = rootNode.getChildNodes();
for (int i = 0 ; i < nlist.getLength() ; i++) {
    Node node = nlist.item(i);
    NodeList children = node.getChildNodes();
    boolean foundLat = false, foundLong = false;
    for (int j = 0 ; j < children.getLength() ; j++) {
	Node child = children.item(j);
	String name = child.getLocalName();
	if ( name == null ) continue;
	if ( name.equals("latitude") ) {
	    float lat = Float.parseFloat(child.getTextContent());
	    if ( lat > 25 && lat < 30 ) foundLat = true;
	} else if ( name.equals("longitude") ) {
	    float lng = Float.parseFloat(child.getTextContent());
	    if ( lng > -90 && lng < -80 ) foundLong = true;
	}
    }
    if ( foundLat && foundLong ) res.add(node);
}

The code above loops through all elements under the root node and selects those children which match the specified conditions: latitude between (25, 30) and longitude between (-90, -80).

As you can see, the code is quite complex and prone to errors. And this is just for finding nodes for some rather simple conditions.

6. Using XPath to Extract Information

Java provides an XPath API which can be used in conjunction with the XML DOM to extract information from XML in an easy manner. XPath in initialized with the application as follows:

XPathFactory xfact = XPathFactory.newInstance();
XPath xpath = xfact.newXPath();

To extract possibly multiple nodes which match an XPath expression, the following method can be used.

Object res = xpath.evaluate(xpathStr, document, XPathConstants.NODESET);

If you know that a single node will match the expression, you can use this method instead.

Object res = xpath.evaluate(xpathStr, document, XPathConstants.NODE);

Maybe you are trying to extract application configuration information from XML? In that case, you might prefer fetching String values in a single call.

String value = xpath.evaluate(xpathStr, document);

7. Some Examples

To compare with the earlier examples, let us find the airport node where <shortcode> equals “PBI“:

String xpathStr = "/airports/airport[shortcode = 'PBI']";
Object res = xpath.evaluate(xpathStr, document, XPathConstants.NODESET);
show((NodeList)res);

Results in the output shown below (partially):

...
  <shortcode>PBI</shortcode>
  <city>West Palm Beach</city>
  <state>FL</state>
  <latitude>26.6831606</latitude>
  <longitude>-80.0955892</longitude>
  <utc>-5</utc>
  <dst>True</dst>
...

And here is the second example: find airports with latitude between (25, 30) and longitude between (-90, -80).

String xpathStr = "/airports/airport[latitude > 25 and latitude < 30 and longitude > -90 and longitude < -80]";
Object res = xpath.evaluate(xpathStr, document, XPathConstants.NODESET);
show((NodeList)res);

Summary

This article demonstrated a couple of ways of extracting data from XML documents. A direct way is to navigate the DOM structure and perform the extraction. This is error prone and sensitive to changes in XML structure. An easier way is to use XPath expression search to extract required information.