Novixys Software Dev Blog | Page 14 of 15 |

How do I Create a Java String from the Contents of a File?

1. Introduction

Here we present a few ways to read the whole contents of a file into a String in Java.

2. Using java.nio.file in Java 7+

Quite simple to load the whole file into a String using the java.nio.file package:

String str = new String(Files.readAllBytes(Paths.get(pathname)),
                        StandardCharsets.UTF_8);

Here is how the code works. Read all the bytes into a byte buffer using the Files and Paths available in the java.nio.file package.

byte[] buf = Files.readAllBytes(Paths.get(pathname));

Convert the byte buffer into a String by specifying the character set.

String str = new String(buf, StandardCharsets.UTF_8);

3. Scan for end-of-input

Another way to read the whole file into a String is to use the Scanner class. Create a scanner with the file as input, set the appropriate delimiter and read the next token.

Note: the actual delimiter used in the code below is the beginning-of-input marker which will not match anywhere other than the beginning of input.

Scanner scanner = null;
try {
    scanner = new Scanner(new File(pathname), "UTF-8");
    return scanner.useDelimiter("\\A").next();
} finally {
    if ( scanner != null ) scanner.close();
}

4. Memory Mapped File Reading

This method maps the file contents directly into memory using the MappedByteBuffer class. Memory mapping the contents directly might lead one to expect enhanced performance. However this advantage is only available if the buffer is used directly. In our case, since we are creating a String from the contents of the file, the speed advantage of the memory mapped buffers is probably not visible.

static private String readFile3(String pathname)
    throws java.io.IOException
{
    File f = new File(pathname);
    RandomAccessFile file = new RandomAccessFile(pathname, "r");
    MappedByteBuffer buffer = file.getChannel().map(MapMode.READ_ONLY,
						    0,
						    f.length());
    file.close();
    return new StringBuilder(StandardCharsets.UTF_8.decode(buffer))
	.toString();
}

ByteBuffer provides a method asCharBuffer() which returns a “view” of the byte buffer as a character buffer. However, there is no way to specify the encoding for converting bytes to characters with this method — probably an oversight in the Java API. The correct way to convert a ByteBuffer to CharBuffer is to use CharSet.decode(ByteBuffer) with the appropriate CharSet instance.

5. Simple Way Using java.io

Of course, there is always the “old” way (pre-Java 1.7) of reading a whole file into a String: reading the characters in a loop and appending to a buffer.

static private String readFile3(String pathname)
    throws java.io.IOException
{
    FileReader in = null;
    try {
	in = new FileReader(pathname);
	char[] buf = new char[2048];
	int len;
	StringBuilder sbuf = new StringBuilder();
	while ((len = in.read(buf, 0, buf.length)) != -1) {
	    sbuf.append(buf, 0, len);
	}
	return sbuf.toString();
    } finally {
	if ( in != null ) in.close();
    }
}

6. Benchmarking Various Approaches

Since we have several methods of reading a whole file into a string, it is interesting to see how the methods stack up against one another in performance. To this end, we implemented a simple benchmarking method using System.currentTimeMillis(). The following is the average time for each run over 1000 runs of the method.

simple       235 ms for 1000 iters: 0.235000 ms/op
nio          213 ms for 1000 iters: 0.213000 ms/op
scanner      629 ms for 1000 iters: 0.629000 ms/op
mmap         285 ms for 1000 iters: 0.285000 ms/op

For the set of conditions under which the application ran, we can conclude that the NIO method is the fastest followed by the Simple method. Slowest is the Scanner method which is somewhat expected since a regular expression search is involved. A note of warning: do not use these benchmark numbers to pick the method to use. Rather use the method closest to your paradigm of problem solving.

Conclusion

You are now aware of various methods of reading the whole contents of a file into a String. Pick whatever suits you best and use it!

How to Parse XML in Java using a DOM Parser

1. Introduction

Let us learn how parse XML in Java using a DOM parser.

DOM stands for Document Object Model. Parsing an XML file using a DOM Parser refers to obtaining a DOM Object from the XML data. The DOM object can then be queries for various XML artifacts like elements, attributes, text nodes, etc.

2. XML Parsing in Java

Java provides three different methods for parsing XML.

DOM Parsing: Parsing XML for DOM refers to obtaining a tree of XML nodes which can then be queried for required information. Here the entire XML tree is loaded into memory and used for queries and updates.
SAX Parsing: Simple API for XML (SAX) is an event-handler kind of parsing where-in the parser fires events on a registered event handler when it encounters XML nodes and attributes. The event handler can then respond to the events and extract whatever information is needed.
StAX: Streaming API for XML is pull-type parser for XML where-in the application “pulls” the information from the XML as needed and acts on this information. This is in contrast to the SAX parsing which can be viewed as a push-type parser.

3. Creating the XML DOM Parser

An XML parser needs to be created to parse XML. It is done as follows.

Create a DocumentBuilderFactory and configure it to your requirements as shown. If you need namespace processing, turn on namespace awareness with setNamespaceAware(true).

To perform DTD validation on the XML document, turn on validation using setValidating(true).

Note that this does not refer to validating the XML with the W3C XML Schema or RELAX NG. See below for more.

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setValidating(true);
factory.setXIncludeAware(true);

3.1. XML Schema Validation

To validate an XML document with against a schema defined in W3C XML Schema or RELAX NG, you need to create a Schema object from a schema file and associate it with the DocumentBuilder object.

SchemaFactory sfactory = SchemaFactory.newInstance();
Schema schema = sfactory.newSchema(schema);
/* DocumentBuilderFactory */ factory.setSchema(schema);

Once the factory is configured to your satisfaction, you can create the DOM Parser.

DocumentBuilder parser = factory.newDocumentBuilder();

4. Parsing XML

Parse an XML file to create the DOM object using the DocumentBuilder.

Document document = parser.parse(new File(xmlFilePath));

Need to parse XML in a String? Construct a StringReader and use an InputSource.

String str = ...;
StringReader reader = new StringReader(str);
InputSource in = new InputSource(reader);
Document document = builder.parse(in);

How about parsing XML from the jar resources folder? Use getResourceAsStream() to get an InputStream and pass it to the parse() method.

String resPath = "/xml/data.xml";
InputStream in = sample.class.getResourceAsStream(resPath);
if ( in == null )
    throw new Exception("resource not found: " + resPath);
Document document = builder.parse(in);

Conclusion

This article explained how to parse XML using a DOM Parser in Java. A DOM Parser can be used when the XML file is small enough to be loaded completely into memory. If the XML file is too large, other Java APIs are available for parsing the XML.

How to Read a File from Resources Folder in Java

1. Introduction

When you build a java project and pack it into a jar (or a war), the files under the resources folder are included into the jar. These files may include configuration files, scripts and other resources needed during run time. When the software is executed, it may need to load the contents of these files for some kind of processing — may be properties, sql statements, etc. In this article, we show you how to load these resources when the program is running.

2. Packaging Resources

Check out the directory hierarchy below:

Maven packs all the files and folders under main/resources into the jar file at the the root. You can access these files and folders from your java code as shown below.

3. Loading the Resources

The following code snippet shows how to load resources packed thus into the jar or war file:

String respath = "/poems/Frost.txt";
InputStream in = sample2.class.getResourceAsStream(respath);
if ( in == null )
    throw new Exception("resource not found: " + respath);

Using the method Class.getResourceAsStream(String), you can get an InputStream to read the resource. The method returns null if the resource cannot be found or loaded.

To read binary resources, you can use directly use the InputStream instance. For reading a text resource, you can convert it to a Reader instance, possibly specifying the character encoding:

InputStreamReader inr = new InputStreamReader(in, "UTF-8");
int len;
char cbuf[] = new char[2048];
while ((len = inr.read(cbuf, 0, cbuf.length)) != -1) {
    // do something with cbuf
}

4. Using Absolute Path of Resource

To load a resource whose full path from the root of the jar file is known, use the full path starting with a “/“.

InputStream in = sample1.class.getResourceAsStream("/poems/Frost.txt");

5. Loading from Relative Paths

When you use a relative path (not starting with a “/“) to load a resource, it loaded relative to the class from which getResourceAsStream() is invoked. For example, to load “app.properties” relative to the invoking class, do not start the path with a “/“.

InputStream in = sample1.class.getResourceAsStream("app.properties");

Conclusion

In this article, we demonstrated how to use the Class.getResourceAsStream() method to load resources from the jar file or war file. An absolute resource path is resolved from the root of the jar file while a relative paths is resolved with respect to the loading class.

How to Convert UTF-16 Text File to UTF-8 in Java?

1. Introduction

In this article, we show how to convert a text file from UTF-16 encoding to UTF-8. Such a conversion might be required because certain tools can only read UTF-8 text. Furthermore, the conversion procedure demonstrated here can be applied to convert a text file from any supported encoding to another.

UTF-8 is a character encoding that can represent all characters (or code points) defined by Unicode. It is designed to be backward compatible with legacy encodings such as ASCII.

UTF-16 is another character encoding that encodes characters in one or two 16-bit code units whereas UTF-8 encodes characters in a variable number of 8-bit code units.

2. Supported Character Sets

You can find the characters sets supported by the JVM using the class java.nio.charset.Charset as follows:

for (Map.Entry e : Charset.availableCharsets().entrySet()) {
     System.out.println(e.getKey());
}

// prints the following
Big5
Big5-HKSCS
CESU-8
IBM-Thai
...
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-8
...

3. Conversion Using java.io Classes

Java provides java.io.InputStreamReader class as a bridge between byte streams to character streams. Open the file using this class to be able to read character buffers in the specified encoding:

Reader in = new InputStreamReader(new FileInputStream(infile), "UTF-16");

Analogously, the class java.io.OutputStreamWriter acts as a bridge between characters streams and bytes streams. Create a Writer with this class to be able to write bytes to the file:

Writer out = new OutputStreamWriter(new FileOutputStream(outfile), "UTF-8");

With the Reader and Writer in place, it is trivial to copy data from the input file to the output file:

char cbuf[] = new char[2048];
int len;
while ((len = in.read(cbuf, 0, cbuf.length)) != -1) {
    out.write(cbuf, 0, len);
}

And that’s it! You have successfully read and converted data from UTF-16 to UTF-8. You can use this code to perform the conversion between any two character sets supported by your JVM.

4. Using String for Converting Bytes

Sometimes, you may have a byte array which you need converted and output in a specific encoding. You can use the String class for these cases as shown below. First convert the byte array into a String:

String str = new String(bytes, 0, len, "UTF-16");

Next, obtain the bytes in the required encoding by using the String.getBytes(String) method:

byte[] outbytes = str.getBytes("UTF-8");

Write the byte array to an OutputStream:

OutputStream out = new FileOutputStream(outfile);
out.write(outbytes);
out.close();

Note that while you could use the String class as shown to convert bytes, you should prefer using Reader/Writer combination when possible to avoid problems with multi-byte characters. Specifically, the byte array you have read may contain an incomplete multi-byte character at the beginning or the end. This may lead to character encoding errors.

Conclusion

When you need to convert text from one character encoding to another in Java, you have several options:

Using InputStreamReader and OutputStreamWriter bridge classes for conversion.
Using the String class directly with specified encoding.
A more advanced option is to use CharsetEncoder and CharsetDecoder class (not presented in this article).

Difference Between HashMap and Hashtable in Java

1. Introduction

Java provides several ways of storing key-value maps (also known as dictionaries). The most common ones are java.util.HashMap and java.util.Hashtable. Let us explore the difference between these two classes.

2. Synchronization

Synchronization is a mechanism in Java for preventing multiple threads from interfering with each other and eliminating memory consistency errors.

When one variable (a resource) is visible to multiple threads at the same time, consistency issues arise when one thread attempts to modify the value while another thread is accessing it. To prevent these issues, some form of synchronization must be used.

While synchronization helps in eliminating consistency errors, it adds an overhead when used. In a single-threaded program (or when you can guarantee that a single thread will access the resource), you can use HashMap to eliminate this overhead. Create a HashMap as follows:

HashMap<String,Object> map = new HashMap<>();
map.put("currentTime", new Date());

However, when accessing or modifying a dictionary shared between multiple threads, you must use Hashtable. The following shows how to create a Hashtable:

Hashtable<String,Integer> tbl = new Hashtable();
tbl.put("count", 32);

3. Using `Null` Keys or Values

When your dictionary needs to contain null keys or values, you cannot use Hashtable since this is not allowed. You must use HashMap in this case.

If you need multiple threads reading or writing the HashMap, you can wrap the HashMap using Collections.synchronizedMap() as follows:

HashMap<String,Object> map = Collections.synchronizedMap(new HashMap<>());
map.put(...);

A HashMap can contain one null key and any number of nulls for values.

4. Predictable Iteration Order

A subclass of HashMap is LinkedHashMap which maintains a doubly-linked list of the entries in the Map. This allows traversal of the Map entries in a predictable order (in the order that the entries were inserted into the Map). If you need such a predictable ordering of the entries, then you can easily replace the HashMap with a LinkedHashMap as follows:

HashMap<String,Object> map = new LinkedHashMap<>();
map.put(...);

When using a Hashtable, such a predictable iteration order is not possible. If this is required, use a LinkedHashMap with a Collections.synchronizedMap() wrapper as above.

5. Iterating using Enumerator

While both HashMap and Hashtable support iteration over the entries using the entrySet(), Hashtable also provides an Enumeration of the entries using the Hashtable.elements() method. In addition, a Hashtable.keys() method also returns an Enumeration over the keys of the Hashtable.

Hashtable<String,Object> tbl = ...;
for(Enumeration<String> keys = tbl.keys() ; tbl.hasMoreElements() ; ) {
  System.out.println(keys.nextElement());
}

Conclusion

Here is how you can decide when to use HashMap or Hashtable:

For using as a shared resource between multiple threads in a single program, a Hashtable is preferred.
When the dictionary needs to contain null keys or values, a HashMap must be used.
A HashMap can be used in a multi-threaded environment by wrapping it with Collections.synchronizedMap().

Converting String to Int in Java

1. Introduction

There are several ways of converting a String to an integer in Java. We present a few of them here with a discussion of the pros and cons of each.

2. Integer.parseInt()

Integer.parseInt(String) can parse a string and return a plain int value. The string value is parsed assuming a radix of 10. The string can contain “+” and “-” characters at the start to indicate a positive or negative number.

int value = Integer.parseInt("25"); // returns 25

int value = Integer.parseInt("-43"); // return -43

int value = Integer.parseInt("+9061"); // returns 9061

Illegal characters within the string (including period “.“) result in a NumberFormatException. Additionally a string containing a number larger than Integer.MAX_VALUE (2³¹ - 1) also result in a NumberFormatException.

// throws NumberFormatException -- contains "."
int value = Integer.parseInt("25.0");

// throws NumberFormatException -- contains text
int value = Integer.parseInt("93hello");

// throws NumberFormatException -- too large
int value = Integer.parseInt("2367423890");

To explicitly specify the radix, use Integer.parseInt(String,int) and pass the radix as the second argument.

// returns 443
int value = Integer.parseInt("673", 8);

// throws NumberFormatException -- contains character "9" not valid for radix 8
int value = Integer.parseInt("9061", 8);

// returns 70966758 -- "h" and "i" are valid characters for radix 20.
int value = Integer.parseInt("123aghi", 20);

3. Integer.valueOf()

The static method Integer.valueOf() works similar to Integer.parseInt() with the difference that the method returns an Integer object rather than an int value.

// returns 733
Integer value = Integer.valueOf("733");

Use this method when you need an Integer object rather than a bare int. This method invokes Integer.parseInt(String) and creates an Integer from the result.

4. Integer.decode()

For parsing an integer starting with these prefixes: “0” for octal, “0x”, “0X” and “#” for hex, you can use the method Integer.decode(String). An optional “+” or “-” sign can precede the number. Al the following formats are supported by this method:

Signopt DecimalNumeral
Signopt 0x HexDigits
Signopt 0X HexDigits
Signopt # HexDigits
Signopt 0 OctalDigits

As with Integer.valueOf(), this method returns an Integer object rather than a plain int. Some examples follow:

// returns 53
Integer value = Integer.decode("0x35");

// returns 1194684
Integer value = Integer.decode("#123abc');

// throws NumberFormatException -- value too large
Integer value = Integer.decode("#123abcdef");

// returns -231
Integer value = Integer.decode("-0347");

5. Convert Large Values into Long

When the value being parsed does not fit in an integer (2³¹-1), a NumberFormatException is thrown. In these cases, you can use analogous methods of the Long class: Long.parseLong(String), Long.valueOf(String) and Long.decode(String). These work similar to their Integer counterparts but return a Long object (or a long in the case of Long.parseLong(String)). The limit of a long is Long.LONG_MAX (defined to be 2⁶³-1).

// returns 378943640350
long value = Long.parseLong("378943640350");

// returns 3935157603823
Long value = Long.decode("0x39439abcdef");

6. Use BigInteger for Larger Numbers

Java provides another numeric type: java.math.BigInteger with an arbitrary precision. A disadvantage of using BigInteger is that common operations like adding, subtracting, etc require method invocation and cannot be used with operators like “+“, “-“, etc.

To convert a String to a BigInteger, just use the constructor as follows:

BigInteger bi = new BigInteger("3489534895489358943");

To specify a radix different than 10, use a different constructor:

BigInteger bi = new BigInteger("324789045345498589", 12);

7. Parse for Numbers Within Text

To parse for numbers interspersed with arbitrary text, you can use the java.util.Scanner class as follows:

String str = "hello123: we have a 1000 worlds out there.";
Scanner scanner = new Scanner(str).useDemiliter("\\D+");
while (s.hasNextInt())
  System.out.printf("(%1$d) ", s.nextInt());
// prints "(123) (1000)"

This method offers a powerful way of parsing for numbers, although it comes with the expense of using a regular expression scanner.

Summary

To summarize, there are various methods of converting a string to an int in java.

Integer.parseInt() is the simplest and returns an int.
Integer.valueOf() is similar but returns an Integer object.
Integer.decode() can parse numbers starting with “0x” and “0” as hex and octal respectively.
For larger numbers, use the corresponding methods in the Long class.
And for arbitrary precision integers, use the BigInteger class.
Finally, to parse arbitrary text for numbers, we can use the java.util.Scanner class with a regular expression.

InputStream to String Conversion in Java

1. Introduction

There are several ways of converting an InputStream to a String in java. Maybe you want to read the data and write it to a log file or do further processing. Here we look at several ways of accomplishing this task.

2. With InputStreamReader

Here is a simple implementation which uses the InputStreamReader to convert from bytes to characters. The code uses the platform default charset to decode the bytes. It reads input in chunks and appends the converted string to a StringBuilder.

private static String inputStreamToString(InputStream in)
    throws java.io.IOException
{
    BufferedReader br = null;
    try {
	InputStreamReader isr = new InputStreamReader(in);
	br = new BufferedReader(isr);
	char cbuf[] = new char[2048];
	int len;
	StringBuilder sbuf = new StringBuilder();
	while ((len = br.read(cbuf, 0, cbuf.length)) != -1)
	    sbuf.append(cbuf, 0, len);
	return sbuf.toString();
    } finally {
	if ( br != null ) br.close();
    }
}

3. Character Set Conversion

When converting input from a character set that is different from the platform default, you must specify the character set as follows:

private static String inputStreamToString(InputStream in,String charsetName)
    throws java.io.IOException
{
    BufferedReader br = null;
    try {
	InputStreamReader isr = new InputStreamReader(in, charsetName);
	br = new BufferedReader(isr);
	char cbuf[] = new char[2048];
	int len;
	StringBuilder sbuf = new StringBuilder();
	while ((len = br.read(cbuf, 0, cbuf.length)) != -1)
	    sbuf.append(cbuf, 0, len);
	return sbuf.toString();
    } finally {
	if ( br != null ) br.close();
    }
}

4. Using try-with-resources

When using a JDK 1.7 or later, you can use the try-with-resources block to eliminate some boilerplate code for exception handling:

private static String inputStreamToString(InputStream in,String charsetName)
    throws java.io.IOException
{
    try (BufferedReader br = new BufferedReader(new InputStreamReader(in, charsetName))) {
	char cbuf[] = new char[2048];
	int len;
	StringBuilder sbuf = new StringBuilder();
	while ((len = br.read(cbuf, 0, cbuf.length)) != -1)
		sbuf.append(cbuf, 0, len);
	return sbuf.toString();
    }
}

The try-with-resources block is used to automatically close resources when the block exits (whether normally or due to an exception).

try (BufferedReader br =
      new BufferedReader(new InputStreamReader(in, charsetName))) {
    // use the resource br here
    }

5. With ByteArrayOutputStream

Another option for converting InputStream to String uses the ByteArrayOutputStream. Here you can accumulate the bytes read from the InputStream and perform the final conversion to the desired character set.

private static String inputStreamToString(InputStream in,String charsetName)
    throws java.io.IOException
{
    try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
	byte buf[] = new byte[2048];
	int len;
	while ((len = in.read(buf)) != -1) out.write(buf, 0, len);
	return out.toString(charsetName);
    }
}

6. Using Apache Commons IO

Converting an InputStream to String can be achieved in a single line by using Apache Commons IO:

private static String inputStreamToString(InputStream in,String charsetName)
    throws java.io.IOException
{
    return IOUtils.toString(in, charsetName);
}

If you are using Maven as your build system, you need the following dependency:

<dependencies>
  <dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.4</version>
  </dependency>
</dependencies>

Conclusion

In this article, you learned several ways of converting an InputStream to String. Depending on your circumstances, you can pick the most appropriate one for your needs.

How do I check if a checkbox is checked in jQuery?

1. Introduction

A frequent concern when using jQuery for web development is: how to check if a checkbox is checked? There are multiple ways of accomplishing this task which we illustrate with examples in this article.

2. Using the DOM property “checked”

A checkbox is of the type HTMLInputElement and has a property called “checked” whose value is “true” if it is checked.

In the following example, this refers to the checkbox DOM element whose property “checked” is being checked. The event handler is invoked after the element is clicked and the function outputs the current state of the checkbox:

$('#check').change(function(ev) {
   if ( this.checked ) console.log('checked');
   else console.log('not checked');
});

Since no jQuery specific features are being used in this example, you can also write this as follows (without using jQuery):

document.getElementById('check').addEventListener('change', function() {
   if ( this.checked ) console.log('checked');
   else console.log('not checked');
});

3. Using jQuery is(‘:checked’)

jQuery provides an .is(‘:checked’) method which can be used as follows:

$('#check').change(function(ev) {
    if ( $(this).is(':checked') ) console.log('checked');
    else console.log('not checked');
});

If multiple checkboxes match the jQuery selector, this method can be used to check if at least one is checked. Consider this HTML which sets up a bunch of checkboxes with the same class.

<label class="checkbox-inline">
<input id="checkboxRed" type="checkbox"
   class="colorCheck"
   value="red"> Red
</label>
<label class="checkbox-inline">
<input id="checkboxGreen" type="checkbox"
   class="colorCheck"
   value="green"> Green
</label>
<label class="checkbox-inline">
<input id="checkboxRed" type="checkbox"
   class="colorCheck"
   value="blue"> Blue
</label>

The following code prints true if at least one is checked:

$('.colorCheck').change(function(ev) {
    console.log('any checked? ' + $('.colorCheck').prop(':checked'));
});

4. Using jQuery .prop()

Another method of checking the state using jQuery is to use the .prop() method:

$('#check').change(function(ev) {
    if ( $(this).prop('checked') ) console.log('4. checked');
    else console.log('4. not checked');
});

Check if one or more checkboxes are checked using the same construct:

$('.colorCheck').change(function(ev) {
    console.log('any checked? ' + $('.colorCheck').prop('checked'));
});

5. Using jQuery “checked” Selector

In the case when you have multiple related checkboxes, you can get a list of the checked ones by doing:

var boxes = $('.colorCheck:checked');
boxes.each(function(i, chkbox) {
  // process each checked box here
});

// How many are checked
console.log('Checked #' + boxes.length);

Conclusion

In this article, we presented various ways of checking the status of a checkbox: whether using jQuery or without. Depending on circumstances, you may prefer one method over another.

What Characters Need to be Escaped in XML Documents?

1. Introduction

Some characters are treated specially when processing XML documents. These are the characters which are used to markup XML syntax; when they appear as a part of a document rather than for syntax markup, they need to be appropriately escaped. These characters are:

"	&quot;		Double quote
'	&apos;		Single quote
<	&lt;		Left angle bracket
>	&gt;		Right angle bracket
&	&amp;		Ampersand

2. Character Data

All text that is not markup constitutes character data of the document. Within character data, “&” and “<” must not appear except when used in markup. However “>”, “”” and “‘” can appear directly within character data without having to be encoded.

<?xml version="1.0"?>
<valid>"'></valid>

<?xml version="1.0"?>
<invalid>&</invalid>

<?xml version="1.0"?>
<invalid><</invalid>

However, “>” cannot appear in the form “]]>” unless it is a part of CDATA ending sections.

<?xml version="1.0"?>
<invalid>]]></invalid>

The following is valid as “>” has been encoded properly:

<?xml version="1.0"?>
<valid>]]&gt;</valid>

2. Attributes

Within attributes, “>” is valid.

<valid attrib=">"></valid>

However, ‘”‘ is not valid within double quotes; it must be encoded using “"”.

<valid attrib="&quot;"></valid>

Similarly “‘” is not valid within single quotes. It must be encoded as “'”:

<valid attrib='&apos;'></valid>

3. Comments

Comments can appear anywhere in a document outside of markup. Within comments, none of the 5 special characters must be escaped or encoded.

<valid><!-- '"<>& --></valid>

In addition, the string “–” must not appear within comments.

<invalid><!-- Hello -- there --></invalid>

A consequence of this rule is that a comment must not end with “—>” (three dashes followed by a right-angle-bracket). The following is invalid:

<invalid><!-- Hello there ---></invalid>

4. Processing Instructions

Processing Instructions (PI) are used to add instructions for XML processing applications. None of the 5 special characters must be encoded within PI statements.

<valid><?execute <>"'& ?></valid>

A further restriction in the case of processing instructions is that the instruction name must not be the string “xml”; this name is reserved for standardization of the XML specification itself.

<invalid><?xml hello ?></invalid>

CData

CDATA sections are used to escape blocks of text containing characters which would otherwise be recognized as markup. This section begin with the string “<![CDATA[” and end with the string “]]>”. Within a CData section, none of the 5 special characters must be encoded.

<valid><![CDATA[[<greeting>Hello world: <>&'" </greeting>]]></valid>

However, within CData sections, the string “]]>” must not appear except to end the section:

<invalid><![CDATA[[This string("]]>") must not appear here]]></invalid>

The same can be re-written with two CData sections as follows:

<valid><![CDATA[[This string("]]]]><![CDATA[[>") must not appear here]]></valid>

Conclusion

This article demonstrated what the predefined XML entities are and the various circumstances in which they can be used.

Parsing XML in Python

1. Introduction

XML can be parsed in python using the xml.etree.ElementTree library. This article shows you how to parse and extract elements, attributes and text from XML using this library.

While this library is easy to use, it loads the whole XML document into memory and hence may not be suitable for processing large XML files or where run-time memory is an issue.

2. Sample XML

Here is a short snippet of the sample XML we will be working with.

<?xml version="1.0"?>
<catalog>
  <book id="bk101">
    <author>Gambardella, Matthew</author>
    <title>XML Developer's Guide</title>
    <genre>Computer</genre>
    <price>44.95</price>
    <publish_date>2000-10-01</publish_date>
    <description>An in-depth look at creating applications 
    with XML.</description>
  </book>
  <book id="bk102">
    <author>Ralls, Kim</author>
    <title>Midnight Rain</title>
...

3. Getting Started

3.1 Parsing an XML File

It is very simple to parse an XML using ElementTree. The following returns an ElementTree object which contains all the XML artifacts.

import xml.etree.ElementTree as ET

tree = ET.parse('catalog.xml')

Once you have the ElementTree object, you can get the root Element object using the tree.getroot() method:

root = tree.getroot()

3.2 Parsing an XML String

We use the ElementTree.fromstring() method to parse an XML string. The method returns root Element directly: a subtle difference compared with the ElementTree.parse() method which returns an ElementTree object.

print ET.fromstring('<a><b>1</b>2</a>')
# prints "<Element 'a' at ...>"

4. Working with Elements

From the Element object, you can get the XML tag name using the tag property.

print root.tag    # outputs "catalog"

Get a list of child Elements from any Element object using element.findall(‘*’). You can find specific children by name by passing the name of the tag e.g. element.findall(‘book’).

For example, the following code recursively processes all the elements in the XML and prints the name of the tag.

def show(elem):
    print elem.tag
    for child in elem.findall('*'):
        show(child)

show(root)

The above code can be modified to show nicely indented output of the tag names:

def show(elem, indent = 0);
    print ' ' * indent + elem.tag
    for child in elem.findall('*'):
         show(elem, indent + 1)

show(root)

To find a single element by name, use elem.find(tagName):

print root.find('book').tag    # prints "book"

5. Working with Attributes

XML attributes can be extracted from an Element object using the element.items() method which returns a sequence of name, value pairs. (The name-value pairs are returned in random order, not in the order they appear in the XML.)

for attrName, attrValue in elem.items():
    print attrName + '=' + attrValue

To retrieve a single attribute value by name, use the elem.get(attrName) method:

print root.find('book').get('id')    # prints "bk101"

Get a list of all attribute names defined on the element using elem.keys(). Returns an empty list if no attributes are defined.

print root.find('book').keys()
# prints ['id']

To get all the attributes as a python dictionary, use the elem.attrib property:

print root.find('book').attrib
# prints {'id': 'bk101'}

6. Retrieving Element Text Content

You can retrieve an element’s text content using the elem.text property as follows:

print ET.fromstring('<a>Hello<b>1</b>2</a>').text
# prints "Hello"

print ET.fromstring('<a>Hello<b>1</b>2</a>').find('b').text
#prints "1"

Text appearing after the element’s end tag is retrieved using elem.tail property:

print ET.fromstring('<a>Hello<b>1</b>2</a>').find('b').tail
# prints "2"

7. Using Path to Extract Content

Some of the Element object methods support extracting content by using a syntax similar to XPath:

Retrieve a descendant element:

print root.find('book/author').text
# prints "Gambardella, Matthew"

To obtain the text of the first matching element, use the elem.findtext() method as follows:

print root.findtext('book/author')
#prints "Gambardella, Matthew"

Retrieve and process a list of matching elements using elem.findall():

for e in root.findall('book/author'):
    print e.text

# prints the following
Gambardella, Matthew
Ralls, Kim
Corets, Eva
Corets, Eva
Corets, Eva
Randall, Cynthia
Thurman, Paula
Knorr, Stefan
Kress, Peter
O'Brien, Tim
O'Brien, Tim
Galos, Mike

Find a specific element by position. (Position indexes start with 1).

print root.find('book[2]/author').text
# prints "Ralls, Kim"

Here is an example to find and concatenate all text content using a reduce operation:

reduce(lambda x, y: x + '|' + y.text, root.findall("book/author"), '')

# prints "|Gambardella, Matthew|Ralls, Kim|Corets, Eva|Corets, Eva|Corets, Eva|Randall, Cynthia|Thurman, Paula|Knorr, Stefan|Kress, Peter|O'Brien, Tim|O'Brien, Tim|Galos, Mike"

Conclusion

This article demonstrated some aspects of parsing XML with python. We showed how to parse an XML file or an XML string and extract elements, attributes and text content.

1. Introduction

2. Using java.nio.file in Java 7+

3. Scan for end-of-input

4. Memory Mapped File Reading

5. Simple Way Using java.io

6. Benchmarking Various Approaches

Conclusion

1. Introduction

2. XML Parsing in Java

3. Creating the XML DOM Parser

3.1. XML Schema Validation

4. Parsing XML

Conclusion

1. Introduction

2. Packaging Resources

3. Loading the Resources

4. Using Absolute Path of Resource

5. Loading from Relative Paths

Conclusion

1. Introduction

2. Supported Character Sets

3. Conversion Using java.io Classes

4. Using String for Converting Bytes

Conclusion

1. Introduction

2. Synchronization

3. Using Null Keys or Values

4. Predictable Iteration Order

5. Iterating using Enumerator

Conclusion

1. Introduction

2. Integer.parseInt()

3. Integer.valueOf()

4. Integer.decode()

5. Convert Large Values into Long

6. Use BigInteger for Larger Numbers

7. Parse for Numbers Within Text

Summary

1. Introduction

2. With InputStreamReader

3. Character Set Conversion

4. Using try-with-resources

5. With ByteArrayOutputStream

6. Using Apache Commons IO

Conclusion

1. Introduction

2. Using the DOM property “checked”

3. Using jQuery is(‘:checked’)

4. Using jQuery .prop()

5. Using jQuery “checked” Selector

Conclusion

1. Introduction

2. Character Data

2. Attributes

3. Comments

4. Processing Instructions

CData

Conclusion

1. Introduction

2. Sample XML

3. Getting Started

3.1 Parsing an XML File

3.2 Parsing an XML String

4. Working with Elements

5. Working with Attributes

6. Retrieving Element Text Content

7. Using Path to Extract Content

Conclusion

See Also

3. Using `Null` Keys or Values