Using the SAX API to Parse XML in Java

Learn how to parse XML using the SAX API

“A word to the wise ain’t necessary, it’s the stupid ones who need advice.”
― Bill Cosby

1. Introduction

Java provides three APIs to parse XML in java: DOM, SAX and StAX. Each API fulfills different requirements, so it is important to know all three. The SAX API is useful particularly when you have large XML documents which you cannot loading using the DOM API. It is also useful when you have your own data structures and need to perform processing while parsing the XML. Let us get to know the SAX API.

2. Creating the SAX Parser

The first task is to create a SAX Parser to be used for parsing the XML document. As with the DOM parser, we create a SAX parser from a SAXParserFactory.

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();

To parse an XML document we need a handler object which is an instance of a class derived from DefaultHandler. This class is where you implement the processing needed to handle various events arising from parsing the XML. We cover the implementation of this class in more detail below. For now, assume we have a class called DebugHandler which extends the DefaultHandler. We parse the XML document using the code below.

String xmlFile = ...;
DebugHandler handler = new DebugHandler();
parser.parse(new File(xmlFile), handler);

3. Implementing the DefaultHandler

We now implement various methods in DefaultHandler to handle the XML events. Our implementation just echoes the various events to the log so we can track how the XML is being processed.

public class DebugHandler extends DefaultHandler
{
    static private Logger log = Logger.getLogger("sample");
}

3.1 startDocument() and endDocument()

The startDocument() is invoked right at the start when the XML document is about to be processed. Likewise, the endDocument() is invoked after the whole document has been completed.

public class DebugHandler extends DefaultHandler
{
    static private Logger log = Logger.getLogger("sample");

    public void startDocument() {
        log.info("in startDocument()");
    }

    public void endDocument() {
        log.info("in endDocument()");
    }
}

3.2 startElement() and endElement()

The startElement() is triggered at the beginning of each element. It includes the Namespace URI, the local name and the qualified name in the callback, along with the attributes. The Namespace URI and the local name are empty when namespace processing is not being performed.

public void startElement(String uri,String localName,String qName,Attributes attributes) {
    log.info("in startElement(" + uri + "," + localName + "," +qName+")");
}

public void endElement(String uri,String localName,String qName) {
    log.info("in endElement(" + uri + "," + localName + "," +qName+")");
}

The output is shown below.

Dec 08, 2017 10:58:39 AM sample.DebugHandler startDocument
INFO: in startDocument()
Dec 08, 2017 10:58:39 AM sample.DebugHandler startElement
INFO: in startElement(,,catalog)
Dec 08, 2017 10:58:39 AM sample.DebugHandler startElement
INFO: in startElement(,,book)
Dec 08, 2017 10:58:39 AM sample.DebugHandler startElement
INFO: in startElement(,,author)
Dec 08, 2017 10:58:39 AM sample.DebugHandler endElement
INFO: in endElement(,,author)
...

3.3 Listing Attributes

The startElement() method also receives the attributes defined on the element. It can be listed out as follows.

public void startElement(String uri,String localName,String qName,Attributes attrs) {
    log.info("in startElement(" + uri + "," + localName + "," +qName+")");
    if ( attrs.getLength() == 0 ) return;
    StringBuilder sbuf = new StringBuilder("Attributes:");
    for (int i = 0, n = attrs.getLength() ; i < n ; i++) {
        sbuf.append('\n').append("  ").append(i+1).append('.')
            .append(attrs.getURI(i)).append(',')
            .append(attrs.getLocalName(i)).append(',')
            .append(attrs.getQName(i)).append('=')
            .append(attrs.getValue(i)).append('[')
            .append(attrs.getType(i)).append(']');
    }
    log.info(sbuf.toString());
}

The attributes are shown for an element as follows:

INFO: Attributes:
  1.,id,id=bk101[CDATA]

3.4 Using the Document Locator

The SAX API provides a method called setDocumentLocator() which is used to set an object that can be used to obtain the location of the reported events. This method is invoked right at the beginning (even before startDocument()) so you can save this object and use it when the document events are reported.

Here our class has been enhanced to save the locator and use it when reporting events.

public class DebugHandler extends DefaultHandler
{
    static private Logger log = Logger.getLogger("sample");
    private Locator locator = null;

    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    private String getLocation() {
        StringBuilder sbuf = new StringBuilder();
        sbuf.append('[').append(locator.getPublicId())
            .append(',').append(locator.getSystemId())
            .append(',').append(locator.getLineNumber())
            .append(',').append(locator.getColumnNumber())
            .append(']');
        return sbuf.toString();
    }

    public void startDocument() {
        log.info("in startDocument() " + getLocation());
    }
...
}

Here is an example event report from the startDocument() event:

...
Dec 08, 2017 11:44:26 AM sample.DebugHandler startDocument
INFO: in startDocument() [null,file:.../data/books.xml,1,1]
...

As you can see from the below startElement() report, the location reported is the location of the character just after the start tag (line 3, column 10).

...
INFO: in startElement(,,catalog) [null,file:.../data/books.xml,3,10]
Dec 08, 2017 11:51:30 AM sample.DebugHandler startElement
...

The location does NOT provide useful information with attributes. It provides just the location of the startElement() event.

And that is not the only disadvantage with the locator implementation. For each event, we would have liked the locator to report the byte offset (or the character offset) of the event from the beginning of the document. However, there is no such support within SAX for this report.

3.5 Handling Character Data

Character data is reported by the SAX parser using the methods characters() and ignorableWhiteSpace(). As the name indicates, ignorableWhiteSpace() is white space within the XML that can be ignored safely without compromising the intergrity of the XML.

public void characters(char[] ch,int start,int length) {
    String str = new String(ch, start, length);
    log.info("in characters(\"" + str + "\"," + start + "," + length + ")");
}

public void ignorableWhitespace(char[] ch,int start,int length) {
    String str = new String(ch, start, length);
    log.info("in ignorableWhitespace(\"" + str + "\"," + start + "," + length + ")");
}

And here is a sample report from the above two methods.

...
Dec 08, 2017 3:35:16 PM sample.DebugHandler ignorableWhitespace
INFO: in ignorableWhitespace("
      ",60,7)
Dec 08, 2017 3:35:16 PM sample.DebugHandler startElement
INFO: in startElement(,,author) [null,file:/home/sridhar/public_html/blogsrc/novixys/75.xml-sax/data/books.xml,5,15]
Dec 08, 2017 3:35:16 PM sample.DebugHandler characters
INFO: in characters("Gambardella, Matthew",75,20)
...

3.6 Error Handling

SAX provides three methods for complete error handling: warning() for mere warnings, error() for recoverable errors and fatalError() for unrecoverable errors. A SAXParseException object is passed to each of these methods. Handle them as you find appropriate.

...
public void warning(SAXParseException e) {
    log.info("in warning(\"" + e.getMessage() + "\")");
}

public void error(SAXParseException e) {
    log.info("in error(\"" + e.getMessage() + "\")");
}

public void fatalError(SAXParseException e) {
    log.info("in fatalError(\"" + e.getMessage() + "\")");
}
...

In the following example, we have an error in the XML which is reported as a fatalError().

<?xml version="1.0"?>
<!DOCTYPE catalog SYSTEM "catalog.dtd">
<catalog>
   <book id="bk101"
      <author>Gambardella, Matthew</author>
...

The error is reported as follows:

Dec 08, 2017 4:18:04 PM sample.DebugHandler fatalError
INFO: in fatalError("Element type "book" must be followed by either attribute specifications, ">" or "/>".")
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:.../data/error1.xml; lineNumber: 5; columnNumber: 7; Element type "book" must be followed by either attribute specifications, ">" or "/>".
...

Conclusion

This article presented an example of using the SAX API to parse XML. SAX uses an event-based model to report XML events while parsing the file. Since the whole XML document is not loaded into memory, such an application can be less memory-intensive in contast to using a DOM-based approach.

Leave a Reply

Your email address will not be published. Required fields are marked *