XML has arrived. Configuration files, application file formats, even database access layers make use of XML-based documents. Fortunately, several high-quality implementations of the standard APIs for handling XML are available. Unfortunately, these APIs are large and therefore provide a formidable hurdle for the beginner.
In this article, I would like to offer an accessible introduction to the two most widely used APIs: SAX and DOM. For each API, I will show a sample application that reads an XML document and turns it into a set of Java objects representing the data in the document, a process known as XML "unmarshalling."
First, a word on style. For instructional purposes, I have kept the code as simple as possible. In order to focus on the basic usage of SAX and DOM, I completely omitted error handling and handling of XML namespaces, among other things. Furthermore, the code has not been tuned for flexibility or elegance; it may be dull, but hopefully it is also obvious.
For those completely new to XML, I would like to review the most important terms and concepts used with XML data.
Each XML document starts with a prologue, followed by the actual document content. The prologue begins with an XML declaration, such as:
<?xml version="1.0" standalone="yes" ?>
The declaration must be at the very beginning of the document -- not
even whitespace may precede it! It is followed by the
document type declaration, which in the
present case only names the root element (catalog), but
in a real-world application would also provide a link to a constraint as
provided by a Document Type Definition (DTD) or XML Schema document:
<!DOCTYPE catalog>
This concludes the prologue. The following body of the XML document is made up of elements, which take the role of (and look like) familiar HTML tags. Every element has a name, and may have an arbitrary number of attributes:
<catalog version="1.0">...</catalog>
Here catalog is the name of the element, having one attribute
named version, with value 1.0. In contrast to
HTML, XML element names are case sensitive and must be closed with the
appropriate closing tag. Note that there must be no space between the opening angle bracket and the element name.
If the element contains neither text nor other elements, the closing tag
may be merged with the start tag (a so-called empty tag):
<catalog version="1.0" />
An element may include either text, or other elements, or a combination
of both. Text may include entity references, similar to those in HTML. In short, an entity reference is a placeholder for another
piece of data. They are often used to include special characters, such
as angle brackets: < or >. Entity references
consist of a ampersand, followed by the entity name and a semicolon:
&entityname;
XML elements have to be properly nested; in particular, the opening and closing tags of different elements must not overlap. In other words, an element's opening and end tags must reside in the same parent. This establishes a clear parent/child relationship among all elements of an XML document. Finally, the outermost element (the one following the prologue) is called the root element.
An element name may be qualified by an XML namespace prefix,
yielding a qualified, or qNam. The namespace prefix is in the form of a Universal Resource Identifier (URI) and is followed by the local name after a colon:
namespace:localname
A document following these rules is syntactically well-formed. This is to be distinguished from its validity, which refers to adherence to the constraint laid out in the DTD or XML Schema document. Note that for a document that does not specify a constraint (such as the example document below), the concept of validity makes no sense.
The document to read describes the catalog of a library. The catalog may contain an arbitrary number of books and magazines. Each book has a title and exactly one author. Each magazine has a name and may contain an arbitrary number of articles. Finally, each article has a headline and a starting page.
<?xml version="1.0"?>
<catalog library="somewhere">
<book>
<author>Author 1</author>
<title>Title 1</title>
</book>
<book>
<author>Author 2</author>
<title>His One Book</title>
</book>
<magazine>
<name>Mag Title 1</name>
<article page="5">
<headline>Some Headline</headline>
</article>
<article page="9">
<headline>Another Headline</headline>
</article>
</magazine>
<book>
<author>Author 2</author>
<title>His Other Book</title>
</book>
<magazine>
<name>Mag Title 2</name>
<article page="17">
<headline>Second Headline</headline>
</article>
</magazine>
</catalog>
Note that the starting page is encoded as an attribute of the
article element. This is done primarily to demonstrate the
use of attributes, although it can be argued that this design decision is
actually semantically justified, since the starting page of an article
is information about the article, but not part of the article
itself.
In the example text, the following elements (called "complex elements" for the purpose of this article) may contain other elements:
<catalog><book><magazine><article>The "simple" elements are those that contain only text:
<author><title><name><headline>There are no elements that contain both text and child elements simultaneously.
The complex elements are represented in the application code by
classes, whereas the simple elements are java.lang.String
member variables of these classes. Since the sole purpose of these
classes is to bundle the data read from the document, their interface
has been kept minimal: they can be instantiated, their data members
can be set, and finally, they override the toString()
method, so as to allow access to the data inside.
class Catalog {
private Vector books;
private Vector magazines;
public Catalog() {
books = new Vector();
magazines = new Vector();
}
public void addBook( Book rhs ) {
books.addElement( rhs );
}
public void addMagazine( Magazine rhs ) {
magazines.addElement( rhs );
}
public String toString() {
String newline = System.getProperty( "line.separator" );
StringBuffer buf = new StringBuffer();
buf.append( "--- Books ---" ).append( newline );
for( int i=0; i<books.size(); i++ ){
buf.append( books.elementAt(i) ).append( newline );
}
buf.append( "--- Magazines ---" ).append( newline );
for( int i=0; i<magazines.size(); i++ ){
buf.append( magazines.elementAt(i) ).append( newline );
}
return buf.toString();
}
}
// --------------------------------------------------------------
class Book {
private String author;
private String title;
public Book() {}
public void setAuthor( String rhs ) { author = rhs; }
public void setTitle( String rhs ) { title = rhs; }
public String toString() {
return "Book: Author='" + author + "' Title='" + title + "'";
}
}
// --------------------------------------------------------------
class Magazine {
private String name;
private Vector articles;
public Magazine() {
articles = new Vector();
}
public void setName( String rhs ) { name = rhs; }
public void addArticle( Article a ) {
articles.addElement( a );
}
public String toString() {
StringBuffer buf = new StringBuffer( "Magazine: Name='" + name + "' ");
for( int i=0; i<articles.size(); i++ ){
buf.append( articles.elementAt(i).toString() );
}
return buf.toString();
}
}
// --------------------------------------------------------------
class Article {
private String headline;
private String page;
public Article() {}
public void setHeadline( String rhs ) { headline = rhs; }
public void setPage( String rhs ) { page = rhs; }
public String toString() {
return "Article: Headline='" + headline + "' on page='" + page + "' ";
}
}
The classes have not been declared public, therefore they have package visibility. The primary consequence of this is that all of them can be defined in the same source file. (To remove possible confusion: the variable name rhs used in the setter methods stands for right-hand-side -- a very convenient naming convention for assignments!)
|
SAX, the Simple API for XML, is a traditional, event-driven parser. It
reads the XML document incrementally, calling certain callback
functions in the application code whenever it recognizes a
token. Callbacks events are generated for the beginning and the end of
a document, the beginning and end of an element, etc. They are defined
in the interface org.xml.sax.ContentHandler, which every
SAX-based document handler class must implement. It is the
responsibility of the application programmer to implement these
callback functions. Often, the application may not care about certain
events reported by the SAX parser. For these cases, there exists a
convenience class, org.xml.sax.helpers.DefaultHandler,
which provides empty implementations for all functions defined in
ContentHandler; custom classes simply extend
DefaultHandler and need only override those callbacks
in which they are specifically interested. This is done in the code
below.
At the heart of a program (or class) utilizing the SAX parser typically lies a stack. Whenever an element is started, a new data object of the appropriate type is pushed onto the stack. Later, when the element is closed, the topmost object on the stack has been finished and can be popped. Unless it has been the root element (in which case the stack will be empty after it has been popped), the most recently popped element will have been a child element of the object that now occupies the top position of the stack, and can be inserted into its parent object. This process corresponds to the shift-reduce cycle of bottom-up parsers. Note how the requirement that XML elements must not overlap is crucial for the proper functioning of this idiom.
Example 1. Unmarshalling with SAX.
class SaxCatalogUnmarshaller extends DefaultHandler {
private Catalog catalog;
private Stack stack;
private boolean isStackReadyForText;
private Locator locator;
// -----
public SaxCatalogUnmarshaller() {
stack = new Stack();
isStackReadyForText = false;
}
public Catalog getCatalog() { return catalog; }
// ----- callbacks: -----
public void setDocumentLocator( Locator rhs ) { locator = rhs; }
// -----
public void startElement( String uri, String localName, String qName,
Attributes attribs ) {
isStackReadyForText = false;
// if next element is complex, push a new instance on the stack
// if element has attributes, set them in the new instance
if( localName.equals( "catalog" ) ) {
stack.push( new Catalog() );
}else if( localName.equals( "book" ) ) {
stack.push( new Book() );
}else if( localName.equals( "magazine" ) ) {
stack.push( new Magazine() );
}else if( localName.equals( "article" ) ) {
stack.push( new Article() );
String tmp = resolveAttrib( uri, "page", attribs, "unknown" );
((Article)stack.peek()).setPage( tmp );
}
// if next element is simple, push StringBuffer
// this makes the stack ready to accept character text
else if( localName.equals( "title" ) || localName.equals( "author" ) ||
localName.equals( "name" ) || localName.equals( "headline" ) ) {
stack.push( new StringBuffer() );
isStackReadyForText = true;
}
// if none of the above, it is an unexpected element
else{
// do nothing
}
}
// -----
public void endElement( String uri, String localName, String qName ) {
// recognized text is always content of an element
// when the element closes, no more text should be expected
isStackReadyForText = false;
// pop stack and add to 'parent' element, which is next on the stack
// important to pop stack first, then peek at top element!
Object tmp = stack.pop();
if( localName.equals( "catalog" ) ) {
catalog = (Catalog)tmp;
}else if( localName.equals( "book" ) ) {
((Catalog)stack.peek()).addBook( (Book)tmp );
}else if( localName.equals( "magazine" ) ) {
((Catalog)stack.peek()).addMagazine( (Magazine)tmp );
}else if( localName.equals( "article" ) ) {
((Magazine)stack.peek()).addArticle( (Article)tmp );
}
// for simple elements, pop StringBuffer and convert to String
else if( localName.equals( "title" ) ) {
((Book)stack.peek()).setTitle( tmp.toString() );
}else if( localName.equals( "author" ) ) {
((Book)stack.peek()).setAuthor( tmp.toString() );
}else if( localName.equals( "name" ) ) {
((Magazine)stack.peek()).setName( tmp.toString() );
}else if( localName.equals( "headline" ) ) {
((Article)stack.peek()).setHeadline( tmp.toString() );
}
// if none of the above, it is an unexpected element:
// necessary to push popped element back!
else{
stack.push( tmp );
}
}
// -----
public void characters( char[] data, int start, int length ) {
// if stack is not ready, data is not content of recognized element
if( isStackReadyForText == true ) {
((StringBuffer)stack.peek()).append( data, start, length );
}else{
// read data which is not part of recognized element
}
}
// -----
private String resolveAttrib( String uri, String localName,
Attributes attribs, String defaultValue ) {
String tmp = attribs.getValue( uri, localName );
return (tmp!=null)?(tmp):(defaultValue);
}
}
Of the various callback methods declared in the
ContentHandler interface, only four are implemented
here. In unmarshalling a document, we are primarily interested in the
contents that are encoded in it. Therefore, the relevant events are the
beginning and end of an element, and the occurrence of raw character
data inside an element. We also implement the
setDocumentLocator() method. Although not used in the
application code, it can be very helpful in debugging. The
org.xml.sax.Locator interface acts like a cursor,
pointing to the position in the XML document where the last event
occurred. It provides useful methods such as
getLineNumber() and getColumnNumber().
When the startElement() function is called, the SAX parser
passes it a number of arguments. The first three are (in order): the namespace
URI, the local name, and the fully qualified name of the element. By default,
only the URI and the local name need to be supplied, while the qualified
name is optional. Since the catalog document does not introduce any XML
namespaces, we only use the local name in the present application.
The last argument holds the attributes of the present element (if any) in a specific container, which allows retrieval of the attributes by their names, as well as iteration over all attributes using an integer index.
|
Related Reading
Java and XML |
Elements are recognized by their local names. If the current element is
a complex element, an object of the appropriate type is instantiated and
pushed onto the stack. If the current element is simple, a new StringBuffer is pushed onto the stack instead, ready to accept character data.
Finally, the <article> element has an attribute, which is read from the attribs argument and inserted into the
newly created article object on top of the stack. The attribute
is extracted using the convenience function resolveAttrib(),
which returns the attribute value or a default text, if the attribute is
missing.
The endElement() function is called with essentially the same
arguments as the startElement() function; only the list of attributes
is missing. In any case, the topmost element on the stack is popped,
converted to the proper type, and inserted into its parent, which now occupies
the top of the stack. Only the root element, which has no parent, is treated
differently.
Finally, the callback function named characters() is called
when the parser encounters raw text. It is passed a char array,
containing the actual data, as well as a position at which to start reading and
the length of data to be read from the array. Of course, it is illegal
to access the data array outside of those boundaries.
The implementation of the callback method inserts the data into the
StringBuffer on the stack.
The way the characters() function is called by the underlying
SAX parser often leads to some initial confusion, for two reasons.
Firstly, there is no guarantee that a stretch of contiguous data results
in only a single call to characters() -- it would be perfectly
legal for the parser to invoke the callback function for each individual
character of text! Although this is certainly an extreme scenario, it is
quite common for text with embedded entity references to result in several calls
to characters(): one for the text before the reference, a separate
call for the entity itself, and finally, one for the remaining text.
This is the reason that a StringBuffer is pushed on the stack
if a simple element is encountered when reading the example document.
(In fact, using a StringBuffer with the characters()
callback function is a common idiom when using the SAX API.)
The second reason that characters() can lead to confusion
results from the fact that it is called for all text characters
encountered by the parser, including whitespace, even the whitespace
between element tags (such as newlines and tabs). This is surprising,
since ContentHandler defines a special callback method
ignorableWhitespace(), taking the same arguments as
characters(). However, without a DTD or XML Schema, this
method is never called, since there is no way for the parser to
distinguish whether some whitespace is ignorable or not.
In the present example program, the boolean flag isStackReady
serves to distinguish between the two. The stack only becomes ready to accept
text when a simple element has started and before it has ended.
|
The Document Object Model (DOM) describes an XML document as a tree-like
structure, with every XML element being a node in the tree. A DOM-based
parser reads the entire document, and (at least in principle)
forms the corresponding document tree in memory. The DOM tree is formed
from classes that all implement the org.w3c.dom.Node interface.
This interface provides functions to walk or modify the tree (such as
getChildNodes(), or appendChild() and
removeChild()), and, of course, methods to query each node
for its name and value.
The present unmarshalling code does not need to modify the DOM tree.
The tree traversal itself is essentially recursive: the root node is
unmarshalled, then each of its child nodes (which are either of type
book or magazine), and, in the case of the
magazine, its children (article). Whenever a child node
has been unmarshalled, the resulting object representation of that
node is inserted into the parent object.
Example 2. Unmarshalling with DOM.
class DomCatalogUnmarshaller {
public DomCatalogUnmarshaller() { }
// -----
public Catalog unmarshallCatalog( Node rootNode ) {
Catalog c = new Catalog();
Node n;
NodeList nodes = rootNode.getChildNodes();
for( int i=0 ; i<nodes.getLength(); i++ ){
n = nodes.item( i );
if( n.getNodeType() == Node.ELEMENT_NODE ){
if( n.getNodeName().equals( "book" ) ) {
c.addBook( unmarshallBook( n ) );
}else if( n.getNodeName().equals( "magazine" ) ){
c.addMagazine( unmarshallMagazine( n ) );
}else{
// unexpected element in Catalog
}
}else{
// unexpected node-type in Catalog
}
}
return c;
}
// -----
private Book unmarshallBook( Node bookNode ) {
Book b = new Book();
Node n;
NodeList nodes = bookNode.getChildNodes();
for( int i=0 ; i<nodes.getLength(); i++ ){
n = nodes.item( i );
if( n.getNodeType() == Node.ELEMENT_NODE ){
if( n.getNodeName().equals( "author" ) ){
b.setAuthor( unmarshallText( n ) );
}else if( n.getNodeName().equals( "title" ) ){
b.setTitle( unmarshallText( n ) );
}else{
// unexpected element in Book
}
}else{
// unexpected node-type in Book
}
}
return b;
}
// -----
private Magazine unmarshallMagazine( Node magazineNode ) {
Magazine m = new Magazine();
Node n;
NodeList nodes = magazineNode.getChildNodes();
for( int i=0 ; i<nodes.getLength(); i++ ){
n = nodes.item( i );
if( n.getNodeType() == Node.ELEMENT_NODE ){
if( n.getNodeName().equals( "name" ) ) {
m.setName( unmarshallText( n ) );
}else if( n.getNodeName().equals( "article" ) ) {
m.addArticle( unmarshallArticle( n ) );
}else{
// unexpected element in Magazine
}
}else{
// unexpected node-type in Magazine
}
}
return m;
}
// -----
private Article unmarshallArticle( Node articleNode ) {
Article a = new Article();
if( articleNode.hasAttributes() == true ) {
a.setPage( unmarshallAttribute( articleNode, "page", "unknown" ) );
}
Node n;
NodeList nodes = articleNode.getChildNodes();
for( int i=0 ; i<nodes.getLength(); i++ ){
n = nodes.item( i );
if( n.getNodeType() == Node.ELEMENT_NODE ){
if( n.getNodeName().equals( "headline" ) ) {
a.setHeadline( unmarshallText( n ) );
}else{
// unexpected element in Article
}
}else{
// unexpected node-type in Article
}
}
return a;
}
// -----
private String unmarshallText( Node textNode ) {
StringBuffer buf = new StringBuffer();
Node n;
NodeList nodes = textNode.getChildNodes();
for( int i=0; i<nodes.getLength(); i++ ){
n = nodes.item( i );
if( n.getNodeType() == Node.TEXT_NODE ) {
buf.append( n.getNodeValue() );
}else{
// expected a text-only node!
}
}
return buf.toString();
}
// -----
private String unmarshallAttribute( Node node,
String name, String defaultValue ){
Node n = node.getAttributes().getNamedItem( name );
return (n!=null)?(n.getNodeValue()):(defaultValue);
}
}
There are subtypes of the Node interface representing
elements, text, comments, entities, and many others. The tree model, by
which each part of the document is represented as a Node,
is followed very consistently. Character data, for instance, is
considered a child of its enclosing Element and is
represented by its own Text instance, which has to be
queried using getNodeValue() to find the actual string.
|
Related Reading
|
The Node supertype offers getNodeName(),
getNodeValue(), and getAttributes() to
provide access to information about a Node instance
without having to downcast it.
Not all three of these methods make
sense for every node type, however. For instance, only an
Element can have attributes; for all other Node
subtypes the corresponding function returns null.
For Element nodes, getNodeName() returns the
tag name, but getNodeValue() returns null. In contrast,
for a Text node, getNodeValue() returns
the character data, while getNodeName() returns the fixed
string "#TEXT". The www.w3.org DOM specification
contains a table detailing the behavior of all three functions for every
possibly node type.
In the present program, we are only interested in three kinds of
nodes: those representing elements, text, and attributes. All of the
unmarshalling functions are very similar to each other. They accept
the topmost node of the subtree they are to unmarshall as an argument.
Then they create an object representing the current node and
iterate over its child nodes, unmarshalling each in turn. If a
child node describes a complex element, the node is passed on to the
appropriate unmarshalling function, depending on the element name. A
child node of type TEXT_NODE describes a simple element,
and the node value is simply the character data.
Nodes describing attributes are a bit different, since attributes are
not really part of the document's tree structure: attributes are not
proper children of the elements in which they are contained. They can
therefore not be reached by tree-walking operations; instead, the
Node class provides a getAttributes() function,
which returns a collection of key/value-pairs, containing the attributes.
Again, we provide a convenience function that returns a default value
in case no attribute can be found for the given name.
Finally, we need a driver class, containing static void main().
The main() function reads the API to use (SAX or DOM) and
the name of the XML file from the command line. It creates a
org.xml.sax.InputSource from the filename. This class
is acceptable to both SAX and DOM as an encapsulation of an XML
document. Then it creates instances of the the appropriate parser
and unmarshaller classes and passes the input file to them. Finally,
it prints the contents of the created objects to standard output.
Example 3. Driver class.
public class Driver {
public static void main( String[] args ) {
Catalog catalog = null;
try {
File file = new File( args[1] );
InputSource src = new InputSource( new FileInputStream( file ) );
if( args[0].equals( "SAX" ) ) {
System.out.println( "--- SAX ---" );
SaxCatalogUnmarshaller saxUms = new SaxCatalogUnmarshaller();
XMLReader rdr = XMLReaderFactory.
createXMLReader( "org.apache.xerces.parsers.SAXParser" );
rdr.setContentHandler( saxUms );
rdr.parse( src );
catalog = saxUms.getCatalog();
}else if( args[0].equals( "DOM" ) ) {
System.out.println( "--- DOM ---" );
DomCatalogUnmarshaller domUms = new DomCatalogUnmarshaller();
org.apache.xerces.parsers.DOMParser prsr =
new org.apache.xerces.parsers.DOMParser();
prsr.parse( src );
Document doc = prsr.getDocument();
catalog = domUms.unmarshallCatalog( doc.getDocumentElement() );
}else{
System.out.println( "Usage: SAX|DOM filename" );
System.exit(0);
}
System.out.println( catalog.toString() );
}catch( Exception exc ) {
System.out.println( "Usage: SAX|DOM filename" );
System.err.println( "Exception: " + exc );
}
}
}
SAX and DOM are interface specifications. Implementations of these
interfaces are available from various sources (both commercial and
free), and it is part of the driver's responsibility to load the
specific parser class. The code above uses the Apache Xerces
implementations of the SAX and DOM specifications; these are freely available,
open source, high-quality implementations. Be sure that the corresponding
classes are included in your CLASSPATH.
The SAX specification contains a factory class that can be used to
select which SAX parser implementation will be used. After instantiating
the XMLReader class, we need to register with it our SAX unmarshaller
as application-specific content handler. Finally, we can retrieve
the unmarshalled objects from the unmarshaller instance.
As opposed to SAX, the DOM specification covers only the tree representation
of the XML document. Instantiating and using the parser is not actually
covered by DOM itself, and the specific implementation must be named
directly in the application code. After the input document has been
parsed, the resulting DOM tree can be retrieved from the parser using
the getDocument function, which returns a Document
instance. The Document interface extends the Node
interface and represents the root node of the document. It is then used
with the appropriate unmarshaller class, similar to the SAX case.
It bears repeating that the code above is for instructional purposes only. It ignores many XML structures (such as namespaces, entities, and, of course, constraints), as well as more advanced features of the parser classes (such as additional SAX callback handlers, or more powerful ways to walk and modify a DOM tree). But the most immediate omission concerns the handling of unexpected elements and similar errors. The locations in the code where these conditions should be handled are clearly marked. It can be enlightening to insert some logging code and then observe the behavior of the program after some "errors" (such as unexpected elements) have been introduced into the XML document. Finally, the document structure has been hard-coded into program. A real-world application would need greater flexibility, or at least better diagnostics.
I hope to have demonstrated how to use either API to parse a simple XML document and turn its data into a set of Java objects. The example application is simple, but it should be enough to get you started. The references contain additional resources.
Philipp K. Janert is a software project consultant, server programmer, and architect.
Return to ONJava.com
Copyright © 2009 O'Reilly Media, Inc.