Simple XML Parsing with SAX and DOM
Pages: 1, 2, 3
Unmarshalling with SAX
SAX, the Simple API for XML, is a traditional, event-driven parser. It
reads the XML document incrementally, calling certain callback
functions in the application code whenever it recognizes a
token. Callbacks events are generated for the beginning and the end of
a document, the beginning and end of an element, etc. They are defined
in the interface org.xml.sax.ContentHandler, which every
SAX-based document handler class must implement. It is the
responsibility of the application programmer to implement these
callback functions. Often, the application may not care about certain
events reported by the SAX parser. For these cases, there exists a
convenience class, org.xml.sax.helpers.DefaultHandler,
which provides empty implementations for all functions defined in
ContentHandler; custom classes simply extend
DefaultHandler and need only override those callbacks
in which they are specifically interested. This is done in the code
below.
At the heart of a program (or class) utilizing the SAX parser typically lies a stack. Whenever an element is started, a new data object of the appropriate type is pushed onto the stack. Later, when the element is closed, the topmost object on the stack has been finished and can be popped. Unless it has been the root element (in which case the stack will be empty after it has been popped), the most recently popped element will have been a child element of the object that now occupies the top position of the stack, and can be inserted into its parent object. This process corresponds to the shift-reduce cycle of bottom-up parsers. Note how the requirement that XML elements must not overlap is crucial for the proper functioning of this idiom.
Example 1. Unmarshalling with SAX.
class SaxCatalogUnmarshaller extends DefaultHandler {
private Catalog catalog;
private Stack stack;
private boolean isStackReadyForText;
private Locator locator;
// -----
public SaxCatalogUnmarshaller() {
stack = new Stack();
isStackReadyForText = false;
}
public Catalog getCatalog() { return catalog; }
// ----- callbacks: -----
public void setDocumentLocator( Locator rhs ) { locator = rhs; }
// -----
public void startElement( String uri, String localName, String qName,
Attributes attribs ) {
isStackReadyForText = false;
// if next element is complex, push a new instance on the stack
// if element has attributes, set them in the new instance
if( localName.equals( "catalog" ) ) {
stack.push( new Catalog() );
}else if( localName.equals( "book" ) ) {
stack.push( new Book() );
}else if( localName.equals( "magazine" ) ) {
stack.push( new Magazine() );
}else if( localName.equals( "article" ) ) {
stack.push( new Article() );
String tmp = resolveAttrib( uri, "page", attribs, "unknown" );
((Article)stack.peek()).setPage( tmp );
}
// if next element is simple, push StringBuffer
// this makes the stack ready to accept character text
else if( localName.equals( "title" ) || localName.equals( "author" ) ||
localName.equals( "name" ) || localName.equals( "headline" ) ) {
stack.push( new StringBuffer() );
isStackReadyForText = true;
}
// if none of the above, it is an unexpected element
else{
// do nothing
}
}
// -----
public void endElement( String uri, String localName, String qName ) {
// recognized text is always content of an element
// when the element closes, no more text should be expected
isStackReadyForText = false;
// pop stack and add to 'parent' element, which is next on the stack
// important to pop stack first, then peek at top element!
Object tmp = stack.pop();
if( localName.equals( "catalog" ) ) {
catalog = (Catalog)tmp;
}else if( localName.equals( "book" ) ) {
((Catalog)stack.peek()).addBook( (Book)tmp );
}else if( localName.equals( "magazine" ) ) {
((Catalog)stack.peek()).addMagazine( (Magazine)tmp );
}else if( localName.equals( "article" ) ) {
((Magazine)stack.peek()).addArticle( (Article)tmp );
}
// for simple elements, pop StringBuffer and convert to String
else if( localName.equals( "title" ) ) {
((Book)stack.peek()).setTitle( tmp.toString() );
}else if( localName.equals( "author" ) ) {
((Book)stack.peek()).setAuthor( tmp.toString() );
}else if( localName.equals( "name" ) ) {
((Magazine)stack.peek()).setName( tmp.toString() );
}else if( localName.equals( "headline" ) ) {
((Article)stack.peek()).setHeadline( tmp.toString() );
}
// if none of the above, it is an unexpected element:
// necessary to push popped element back!
else{
stack.push( tmp );
}
}
// -----
public void characters( char[] data, int start, int length ) {
// if stack is not ready, data is not content of recognized element
if( isStackReadyForText == true ) {
((StringBuffer)stack.peek()).append( data, start, length );
}else{
// read data which is not part of recognized element
}
}
// -----
private String resolveAttrib( String uri, String localName,
Attributes attribs, String defaultValue ) {
String tmp = attribs.getValue( uri, localName );
return (tmp!=null)?(tmp):(defaultValue);
}
}
Of the various callback methods declared in the
ContentHandler interface, only four are implemented
here. In unmarshalling a document, we are primarily interested in the
contents that are encoded in it. Therefore, the relevant events are the
beginning and end of an element, and the occurrence of raw character
data inside an element. We also implement the
setDocumentLocator() method. Although not used in the
application code, it can be very helpful in debugging. The
org.xml.sax.Locator interface acts like a cursor,
pointing to the position in the XML document where the last event
occurred. It provides useful methods such as
getLineNumber() and getColumnNumber().
Start of Element
When the startElement() function is called, the SAX parser
passes it a number of arguments. The first three are (in order): the namespace
URI, the local name, and the fully qualified name of the element. By default,
only the URI and the local name need to be supplied, while the qualified
name is optional. Since the catalog document does not introduce any XML
namespaces, we only use the local name in the present application.
The last argument holds the attributes of the present element (if any) in a specific container, which allows retrieval of the attributes by their names, as well as iteration over all attributes using an integer index.
|
Related Reading
Java and XML |
Elements are recognized by their local names. If the current element is
a complex element, an object of the appropriate type is instantiated and
pushed onto the stack. If the current element is simple, a new StringBuffer is pushed onto the stack instead, ready to accept character data.
Finally, the <article> element has an attribute, which is read from the attribs argument and inserted into the
newly created article object on top of the stack. The attribute
is extracted using the convenience function resolveAttrib(),
which returns the attribute value or a default text, if the attribute is
missing.
End of Element
The endElement() function is called with essentially the same
arguments as the startElement() function; only the list of attributes
is missing. In any case, the topmost element on the stack is popped,
converted to the proper type, and inserted into its parent, which now occupies
the top of the stack. Only the root element, which has no parent, is treated
differently.
Raw Text
Finally, the callback function named characters() is called
when the parser encounters raw text. It is passed a char array,
containing the actual data, as well as a position at which to start reading and
the length of data to be read from the array. Of course, it is illegal
to access the data array outside of those boundaries.
The implementation of the callback method inserts the data into the
StringBuffer on the stack.
The way the characters() function is called by the underlying
SAX parser often leads to some initial confusion, for two reasons.
Firstly, there is no guarantee that a stretch of contiguous data results
in only a single call to characters() -- it would be perfectly
legal for the parser to invoke the callback function for each individual
character of text! Although this is certainly an extreme scenario, it is
quite common for text with embedded entity references to result in several calls
to characters(): one for the text before the reference, a separate
call for the entity itself, and finally, one for the remaining text.
This is the reason that a StringBuffer is pushed on the stack
if a simple element is encountered when reading the example document.
(In fact, using a StringBuffer with the characters()
callback function is a common idiom when using the SAX API.)
The second reason that characters() can lead to confusion
results from the fact that it is called for all text characters
encountered by the parser, including whitespace, even the whitespace
between element tags (such as newlines and tabs). This is surprising,
since ContentHandler defines a special callback method
ignorableWhitespace(), taking the same arguments as
characters(). However, without a DTD or XML Schema, this
method is never called, since there is no way for the parser to
distinguish whether some whitespace is ignorable or not.
In the present example program, the boolean flag isStackReady
serves to distinguish between the two. The stack only becomes ready to accept
text when a simple element has started and before it has ended.