Monday, August 23, 2010

Parsing complex nested XML with SAXParser and collections

In my previous example, I have explained how to parse simple XML with on and multiple elements or objects. This is beyond that. Have you ever encountered and XML which have nested elements and the nested elements again have same nested objects and the tree grows. I happen to encounter such huge XML of size in few MBs. Sometimes collections like java.util.Stack or java.util.Map are helpful and very useful in this case. While parsing, you have to add and object as a child to another object, but it becomes difficult to determine to which parent object the child has to be added. The image below shows a basic tree structure of the XML file in this demo example.


The image shows an element 'business-object' with some attributes, sub elements as fields and references. This references are again business objects and this nesting continues. Let us study the snapshot above and design first then develop the java objects, then the Handler using Stack and parse the same. XML File including all the java source files are available at the bottom. I would just brief up the snippet code while explaining.
Starting from business-object, it has fields and references (which are again business objects). So, in my BusinessObject.java I would declare variables like below
    private final int id;
    private final String type;
    // list of fields.
    private List<Field> fields = null;
    // list of references which are business objects again.
    private List<BusinessObject> references = null;

    // .........

    /**
     * Adds one more field to the business object
     */
    public void addField(Field field) {
        // lazy initialization.
        if (fields == null)
            fields = new ArrayList<Field>();
        fields.add(field);
    }

    public void addReference(BusinessObject reference) {
        // lazy initialization.
        if (references == null)
            references = new ArrayList<BusinessObject>();
        references.add(reference);
    }
The method addField() will add to current list of field another field. Similarly addReference() will add another BusinessObject to the references. Note that this are lazily initialized. I.e. List is only created when needed, else this remains null. When we have a big hierarchy or object relations save un necessary object creations. Now what is this Field class? there is some thing special inside. If you see the XML file business-object.xml, business-object also have <ref> fields (not in the snapshot above). This is also a field again but with different meaning for the object. Hence, snippet from Field.java is
    public Field(String column) {
        this(column, false);
    }

    public Field(String column, boolean refField) {
        this.column = column;
        this.isReferenceField = refField;
    }

    public boolean isReferenceField() {
        return isReferenceField;
    }
the boolean determines if the field is a simple field or a reference field. Use the constructor need for creating a Field or a Reference field. Now the Handler. We can either go by using Stack or a Map (HashMap/LinkedHashMap). In Map approach as soon as you parse the business-object, store it in map and get the business object back using the id of the object, to get current business object been parsed, just get hold of the id and retrieve the object from Map. I have used the approach for using Stack. Snippet from Handler.java with only variable declaration and methods like startElement() and endElement().
    private final String BUSINESS_OBJECT = "business-object",
            REFERENCES = "references", FIELD = "field", REF_FIELD = "ref";
    // XML tag attributes
    private final String ID_ATTR = "id", TYPE_ATTR = "type", 
            COL_ATTR = "col", TABLE_ATTR = "table";
    // parent business object
    private BusinessObject businessObject;
    // internal collection to capture temporary objects while parsing.
    private Stack<BusinessObject> references = new Stack<BusinessObject>();
    private Field lastField = null;

    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes attributes) throws SAXException {

        if (BUSINESS_OBJECT.equals(qName)) {
            // create new business object.
            BusinessObject currentBusinessObject = new BusinessObject(getInt(
                    attributes, ID_ATTR), getString(attributes, TYPE_ATTR));

            // TRIK: see if any business object is already in stack, if so add
            // this as its reference
            BusinessObject lastBusinessObject = getLastBusinessObject();
            if (lastBusinessObject != null)
                lastBusinessObject.addReference(currentBusinessObject);
            else {
                // this is the first business object
                businessObject = currentBusinessObject;
            }

            // new business object been parsed. add current to stack
            references.push(currentBusinessObject);
        } else if (REFERENCES.equals(qName)) {
            // References are again business objects, do nothing as that will be
            // added to underlying references to parent business object.
        } else if (FIELD.equals(qName)) {
            // create new field
            Field f = new Field(getString(attributes, COL_ATTR));
            // TRIK: add this field to recent business object
            references.peek().addField(f);
            // TRIK: hold this fields ref. so that its value can be stored in
            // endElement().
            lastField = f;
        } else if (REF_FIELD.equals(qName)) {
            // create new ref field
            Field f = new Field(getString(attributes, TABLE_ATTR), true);
            f.setValue(getString(attributes, ID_ATTR));
            // TRIK: add this ref field to recent business object
            references.peek().addField(f);
        }
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
        if (BUSINESS_OBJECT.equals(qName)) {
            // done processing this Business object, remove from stack.
            references.pop();
        } else if (FIELD.equals(qName)) {
            // capture value
            String val = getBufferValue();
            // set the value by getting the ref of last field.
            lastField.setValue(val);
        }
        // always clear the buffer
        buffer.setLength(0);
    }
We have a stack of BusinessObject into which we will be pushing the business-object in startElement(), if we open first business-object, the Stack will contain only one object and this will be the one which is last and eligible for popping i.e. retrieval. If the parser is into second business-object while the first not closed, then the retrieval object in stack will be the second and this is the current. endElement() will pop (remove) the last inserted element from the stack and hence now the first object is eligible for retrieval. By this approach, when ever a <field> is encountered, get the last object from stack (via peek, it gives reference but will not remove) and add field into it. Same applies for reference business object too. Another challenge is to get hold of Field on close element when we actually get the value of field in endElement() but the object was already created in startElement(). This is achieved by holding the reference of last Field and using it when ever needed. The snapshot above uses many private utility methods which can be copied from the full sources at the end of this post. Now when we have the Handler, use this handler and parse the file. I would use the same Parser.java as in my previous example for simple XML, nothing changes there. The Demo class App.java to parse and display the results including other sources are at the end.
Do post your comments or suggestions.

business-object.xml
<?xml version="1.0" encoding="UTF-8" ?>
<document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <business-object type="organization" id="111">
        <field col="name">My Organization</field>
        <field col="type">Services</field>
        <references type="department" ref="organization">
            <business-object type="department" id="001">
                <field col="name">Development</field>
                <field col="location">INDIA</field>
                <field col="code">DEV-IN</field>
                <references type="person" ref="department">
                    <business-object type="person" id="101">
                        <field col="name">Mohammed Bin Mahmood</field>
                        <field col="code">MBM</field>
                        <field col="role">Dev</field>
                        <ref table="department" id="001" />
                    </business-object>
                </references>
            </business-object>
            <business-object type="department" id="002">
                <field col="name">Research</field>
                <field col="location">INDIA</field>
                <field col="code">RES-IN</field>
                <references type="person" ref="department">
                    <business-object type="person" id="201">
                        <field col="name">Mohammed</field>
                        <field col="code">MD</field>
                        <field col="role">Sci</field>
                        <ref table="department" id="002" />
                    </business-object>
                    <business-object type="person" id="202">
                        <field col="name">Mohammed B.</field>
                        <field col="code">MB</field>
                        <field col="role">Java</field>
                        <ref table="department" id="002" />
                    </business-object>
                </references>
            </business-object>
        </references>
    </business-object>
</document>
BusinessObject.java
package com.mbm.demo.xml2.main;

import java.util.ArrayList;
import java.util.List;

/**
 * @author Mohammed Bin Mahmood
 */
public class BusinessObject {

    private final int id;
    private final String type;
    // list of fields.
    private List<Field> fields = null;
    // list of references which are business objects again.
    private List<BusinessObject> references = null;

    public BusinessObject(int id, String type) {
        this.id = id;
        this.type = type;
    }

    public int getId() {
        return id;
    }

    public String getType() {
        return type;
    }

    public List<Field> getFields() {
        return fields;
    }

    /**
     * Adds one more field to the business object
     */
    public void addField(Field field) {
        // lazy initialization.
        if (fields == null)
            fields = new ArrayList<Field>();
        fields.add(field);
    }

    public List<BusinessObject> getReferences() {
        return references;
    }

    public void addReference(BusinessObject reference) {
        // lazy initialization.
        if (references == null)
            references = new ArrayList<BusinessObject>();
        references.add(reference);
    }

    /**
     * Method to determine if this business object have further more references.
     */
    public boolean hasReferences() {
        return references != null;
    }

    @Override
    public String toString() {
        return id + "-" + type + " has "
                + (hasReferences() ? references.size() : "no") + " reference(s)";
    }
}
Field.java
package com.mbm.demo.xml2.main;

/**
 * @author Mohammed Bin Mahmood
 */
public class Field {
    private final String column;
    private String value;
    private boolean isReferenceField = false;

    public Field(String column) {
        this(column, false);
    }

    public Field(String column, boolean refField) {
        this.column = column;
        this.isReferenceField = refField;
    }

    public boolean isReferenceField() {
        return isReferenceField;
    }

    public String getColumn() {
        return column;
    }

    public String getValue() {
        return value;
    }

    public void setValue(String value) {
        this.value = value;
    }

    @Override
    public String toString() {
        return (isReferenceField ? "REF>" : "FIELD>") + column + "-" + value;
    }
}
Handler.java
package com.mbm.demo.xml2.parser;

import java.util.EmptyStackException;
import java.util.Stack;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import com.mbm.demo.xml2.main.BusinessObject;
import com.mbm.demo.xml2.main.Field;

/**
 * @author Mohammed Bin Mahmood
 */
public class Handler extends DefaultHandler {
    private final StringBuilder buffer = new StringBuilder(128);
    // XML tag names
    private final String BUSINESS_OBJECT = "business-object",
            REFERENCES = "references", FIELD = "field", REF_FIELD = "ref";
    // XML tag attributes
    private final String ID_ATTR = "id", TYPE_ATTR = "type", 
            COL_ATTR = "col", TABLE_ATTR = "table";
    // parent business object
    private BusinessObject businessObject;
    // internal collection to capture temporary objects while parsing.
    private Stack<BusinessObject> references = new Stack<BusinessObject>();
    private Field lastField = null;

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        // add characters to the buffer
        buffer.append(ch, start, length);
    }

    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes attributes) throws SAXException {

        if (BUSINESS_OBJECT.equals(qName)) {
            // create new business object.
            BusinessObject currentBusinessObject = new BusinessObject(getInt(
                    attributes, ID_ATTR), getString(attributes, TYPE_ATTR));

            // TRIK: see if any business object is already in stack, if so add
            // this as its reference
            BusinessObject lastBusinessObject = getLastBusinessObject();
            if (lastBusinessObject != null)
                lastBusinessObject.addReference(currentBusinessObject);
            else {
                // this is the first business object
                businessObject = currentBusinessObject;
            }

            // new business object been parsed. add current to stack
            references.push(currentBusinessObject);
        } else if (REFERENCES.equals(qName)) {
            // References are again business objects, do nothing as that will be
            // added to underlying references to parent business object.
        } else if (FIELD.equals(qName)) {
            // create new field
            Field f = new Field(getString(attributes, COL_ATTR));
            // TRIK: add this field to recent business object
            references.peek().addField(f);
            // TRIK: hold this fields ref. so that its value can be stored in
            // endElement().
            lastField = f;
        } else if (REF_FIELD.equals(qName)) {
            // create new ref field
            Field f = new Field(getString(attributes, TABLE_ATTR), true);
            f.setValue(getString(attributes, ID_ATTR));
            // TRIK: add this ref field to recent business object
            references.peek().addField(f);
        }
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
        if (BUSINESS_OBJECT.equals(qName)) {
            // done processing this Business object, remove from stack.
            references.pop();
        } else if (FIELD.equals(qName)) {
            // capture value
            String val = getBufferValue();
            // set the value by getting the ref of last field.
            lastField.setValue(val);
        }
        // always clear the buffer
        buffer.setLength(0);
    }

    private BusinessObject getLastBusinessObject() {
        // just return the reference, do not remove.
        try {
            return references.peek();
        } catch (EmptyStackException e) {
            return null;
        }
    }

    /**
     * Returns the current value of the buffer, or null if it is empty or
     * whitespace. This method also resets the buffer.
     */
    private String getBufferValue() {
        if (buffer.length() == 0)
            return null;
        String value = buffer.toString().trim();
        buffer.setLength(0);
        return value.length() == 0 ? null : value;
    }

    /**
     * The business object parsed by this handler
     */
    public BusinessObject getBusinessObject() {
        return businessObject;
    }

    // --- UTILITIES ---

    private static int getInt(Attributes attributes, String name) {
        String v = getString(attributes, name);
        try {
            return Integer.parseInt(v);
        } catch (NumberFormatException e) {
            return 0;
        }
    }

    private static String getString(Attributes attributes, String name) {
        String s = attributes.getValue(name);
        if (s == null)
            return null;
        // trim leading and trailing spaces.
        s = s.trim();
        // see if empty string.
        if (s.length() == 0)
            return null;
        return s;
    }
}
App.java
package com.mbm.demo.xml2.main;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.xml.sax.SAXException;

import com.mbm.demo.xml2.parser.Handler;
import com.mbm.demo.xml2.parser.Parser;

/**
 * @author Mohammed Bin Mahmood
 */
public class App {

    public static void main(String[] args) throws IOException, SAXException {
        File xml = new File("business-object.xml");
        FileInputStream stream = new FileInputStream(xml);
        BusinessObject bo;
        try {
            Handler handler = new Handler();
            Parser parser = new Parser(handler);
            parser.parse(stream);
            bo = handler.getBusinessObject();
        } finally {
            stream.close();
        }

        // print bo;
        printBO(bo, "#");
    }

    private static void printBO(BusinessObject bo, String intend) {
        System.out.println(intend + bo);
        // print fields.
        for (Field f : bo.getFields()) {
            System.out.println(intend + intend + f);
        }
        // print all references
        if (!bo.hasReferences())
            return;
        for (BusinessObject b : bo.getReferences()) {
            printBO(b, intend + intend);
        }
    }
}

3 comments:

  1. Thanks Much for this code, but I got confused little bit, Mohammed I have this Xml file and I need to read each class element class and print all classes that depend on(used by), so can you help me with that,

    ReplyDelete
  2. These are the most complete references to reading nested XML files with attributes I have seen to date. Thank you for these posts.

    ReplyDelete
  3. this is great example, you gave me a good idea to parse using stack - thanks

    ReplyDelete

Was this article useful?