New Page 1

* What is the disk DOM? *

This DOM implementation is just like your average DOM except that it can store the parsed node data to a file somewhere. Most DOM implementations store their parsed node data in RAM, meaning the the available RAM in the computer limits the size of the XML file that it can create or parse.

This disk DOM stores its parsed node data to a disk file, meaning you can create or parse really large XML files and not eat up all your RAM. The trade-off for this is speed - a RAM based DOM implementation is much faster.

This DOM implementation implements all the classes and interfaces in the org.w3c.dom package, and the DocumentBuilder and DocumentBuilderFactory classes in javax.xml.parsers. It passes all the tests in the test suite at www.w3c.org, so in theory it should be do most things a DOM implementation does.

I wrote this disk DOM because I wanted to be able to use XSLT to transform very large XML files, and I just couldn't find a DOM implementation that could handle XML files that big. Yes I could have written a one-off SAX based or proprietary solution, but I needed to be able to apply lots of different and complex transformations to many different shapes of XML files, and XSLT was just the thing for that job. All I needed was a DOM implementation that could hold really large XML files.

* How to use the disk DOM *

The diskdom.jar file has the "javax.xml.parsers.DocumentBuilderFactory" service registered within itself, so to use the disk DOM you need only include the jar file in your classpath when you run your program.

eg If we wanted to run a java program called "MyTest.class", and the "PersistentDOM.jar" file was in the C:\tmp directory then the command line to run the program would look like:

java -classpath c:\tmp\PersistentDOM.jar; MyTest

Alternately if for some reason you can't use the jar file - ie you have re-compiled the disk DOM source into class files and want to use those class files with your program - you can set the "javax.xml.parsers.DocumentBuilderFactory" system variable to the name of the DocumentBuilderFactory class of this DOM implementation - in this case

au.com.explodingsheep.persistentDOM.documentBuilder.MyDocumentBuilderFactory

eg You have all the classes in the diskdom.jar file in the "c:\tmp\bin" directory, and you want to run the program "MyTest.class" you would use the following command line:

java -classpath c:\tmp\bin; -Djavax.xml.parsers.DocumentBuilderFactory=
  au.com.explodingsheep.persistentDOM.documentBuilder.MyDocumentBuilderFactory MyTest

* How to tell the disk DOM where to write the parsed data to *

The DiskDOM implementation of the DocumentBuilderFactory has an additional method where you can specify the name of the file where DocumentBuilder instances must write their parsed node data to. That method is called "setStorageFileName( String fileName )".

When you call DocumentBuilderFactory.newInstance() it will return an instance of a MyDocumentBuilderFactory. This class has the "setStorageFileName( String fileName )" method.

Setting this method tells the Factory that the DocumentBuilder instance it next creates with "newDocumentBuilder()" will write its parsed node data to the specified filename.

If you don't specify a filename, or use a filename of null then the parsed node data will be stored in RAM, making this DOM implementation work like any other DOM implementation.

Use the "newDocument" or "parse" methods of DocumentBuilder to create a Document as you normally would.

A disk DOM file can hold multiple Documents, just call "newDocument" or "parse" as many times as required. Each Document instance returned from "newDocument" and "parse" has an additional method called "setDocumentName( String name )" where you specify a name to identify the Document in the disk DOM file. If in the DocumentBuilderFactory you specify the filename of an already existing disk DOM file you can re-load the Documents in it using the name you gave the Document.

Below is an example of parsing an XML file called "/tmp/TestData.xml" into a Document, naming that Document "TestDocument", and having the parsed node data be stored in the disk DOM file "/tmp/ParsedNodeData.dom".

MyDocumentBuilderFactory documentBuilderFactory = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
documentBuilderFactory.setStorageFileName( "/tmp/ParsedNodeData.dom" ); 
MyDocumentBuilder documentBuilder = ( MyDocumentBuilder ) documentBuilderFactory.newDocumentBuilder(); 
try 
{ 
  InputStream inputStream = new FileInputStream( "/tmp/TestData.xml" ); 
  MyDocument document = ( MyDocument ) documentBuilder.parse( inputStream ); 
  document.setDocumentName( "TestDocument" ); 
} 
finally 
{ 
  documentBuilder.close(); 
}

Once you are finished creating your Document instances in the parsed-data file you must close that file calling the "close()" method of the MyDocumentBuilder class. !!! Make sure you don't create multiple DocumentBuilder instances that all write to the same file as bad things will happen. Each time you call "newDocumentBuilder" make sure you first specify a different storage filename. ie don't do this:

MyDocumentBuilderFactory dbf = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
dbf.setStorageFileName( "/tmp/MyDOM/ParsedData.dom" ); 
MyDocumentBuilder db1 = ( MyDocumentBuilder ) dbf.newDocumentBuilder(); 
MyDocumentBuilder db2 = ( MyDocumentBuilder ) dbf.newDocumentBuilder(); 
// !!! Don't do this!

Instead change the filename for each DocumentBuilder you create:

MyDocumentBuilderFactory dbf = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
dbf.setStorageFileName( "/tmp/MyDOM/ParsedData1.dom" ); 
MyDocumentBuilder db1 = ( MyDocumentBuilder ) dbf.newDocumentBuilder(); 
dbf.setStorageFileName( "/tmp/MyDOM/ParsedData2.dom" ); 
MyDocumentBuilder db2 = ( MyDocumentBuilder ) dbf.newDocumentBuilder();

This doesn't apply if you are writing the parsed node data to RAM, as the RAM "disk"s both grab whatever RAM they need with conflicting with each other. ie the following is OK because the parsed node data is being stored to RAM, not a disk file.

MyDocumentBuilderFactory dbf = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
dbf.setStorageFileName( null ); 
MyDocumentBuilder db1 = ( MyDocumentBuilder ) dbf.newDocumentBuilder(); 
MyDocumentBuilder db2 = ( MyDocumentBuilder ) dbf.newDocumentBuilder();

To open a previously parsed Document use the "getDocument( String documentName )" method of the MyDocumentBuilder instance. The "documentName" parameter in this method refers to the name of the Document node.

eg to open the already existing disk DOM file called "/tmp/ParsedNodeData.dat" and load a Document called "TestDocument":

MyDocumentBuilderFactory documentBuilderFactory = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
documentBuilderFactory.setStorageFileName( "/tmp/ParsedNodeData.dat" ); 
MyDocumentBuilder documentBuilder = ( MyDocumentBuilder ) documentBuilderFactory.newDocumentBuilder(); 
MyDocument document = ( MyDocument ) documentBuilder.getDocument( "TestDocument" );

* Classes and interfaces of note *

public interface MyDocumentBuilderFactory extends DocumentBuilderFactory 
{ 
  /** 
   * Sets whether the DocumentBuilder we create in "newDocumentBuilder" 
   * writes its data to disk. 
   * If we pass null here the DocumentBuilder will write its nodes to RAM. 
   */ 

  public void setStorageFileName( String storageFileName ); 
} 


public interface MyDocumentBuilder extends DocumentBuilder 
{ 
  /** 
   * Loads and returns the document with the passed Identifier. 
   */ 

  public abstract MyDocument getDocument( String documentName ) throws MyDocumentBuilderException;

  /** 
   * Returns the names of all the MyDocuments in this 
   * DocumentBuilder. 
   */ 

  public abstract String[] getDocumentNames() throws MyDocumentBuilderException; 
} 


public interface MyDocument implements Document 
{ 
  /** 
   * When a MyDocumentBuilder is closed, all the MyDocuments in it 
   * are also closed. 
   * When a MyDocumentBuilder is re-opened, and its disk DOM file 
   * refers to an already existing file, then the MyDocuments in 
   * that file can be retrieved. This document-name is how you 
   * retrieve each MyDocument. 
   */ 

  public void setDocumentName( String documentName ) throws NodeStoreException, DOMException, MyDOMImplementationException; 
}

* How to compile the source code *

If you have downloaded the source code it can be re-compiled using ant. There is an ant build.xml file in the root of the src directory of the source code.

* Note on using XLT - problems with xalan, using saxon *

When using the Xalan XSLT processor I notice that lots of RAM is used, proportional to the size of the source DOM document. I suspect that Xalan is caching the DOM nodes into its internal DTM (Document Table Model) classes - thus nullifying the reason for using a disk based DOM. I could be wrong on this but it is the only thing I can think of. If I use the Saxon XSLT processor no additional RAM is used.

* Performance *

Speed: When parsing an XML file using the DocumentBuilder.parse( InputStream ) method the disk DOM parses around 500 nodes per second on my Pentium 450 with a rather slow hard disk.

File size: If the DocumentBuilderFactory has been told to ignore whitespaces then on average disk DOM produced is about 6 to 7 times larger than the source XML file. If whitespaces are not being ignored then the disk DOM file is around 10 times larger than the source XML file. Ie with whitespaces being ignored, given a sample source XML file that is 1MB the disk DOM file produced is around 6.9MB.

* Problems *

The naming convention for classes and interfaces is bad. I started naming my own interface extensions to the Node, Element, Text etc interfaces as MyNode, MyElement etc. I then made classes that implement the "My" interfaces as DefaultMyNode, DefaultMyElement etc. But then later on I have created some classes that start with "My" instead of "DefaultMy". I need to give every class an underlying interface "xxx", then have classes called "xxxImpl". Yes this is obvious, but I have been writing this part time so slipped a bit here and there.

The DOM2 specific functionality needs to be thoroughly tested. Although I have implemented the DOM level 2 methods I know many of the rules in the DOM2 specification are not being enforced. I will get onto this as a matter of urgency.

* Examples *

Parse the XML file "/tmp/TestData.xml" into a Document, give it a name of "TestDocument", and save all the parsed node data into the file "/tmp/ParsedNodeData.dom".

import java.io.*; 
import javax.xml.parsers.*; 
import org.w3c.dom.*; 
import au.com.explodingsheep.persistentDOM.*; 
import au.com.explodingsheep.persistentDOM.documentBuilder.*; 
public class ExampleParseXMLDocument 
{ 
  public static void main( String args[] ) 
  { 
    try 
    { 
      MyDocumentBuilderFactory documentBuilderFactory = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
      documentBuilderFactory.setStorageFileName( "/tmp/ParsedNodeData.dom" ); 
      MyDocumentBuilder documentBuilder = ( MyDocumentBuilder ) documentBuilderFactory.newDocumentBuilder(); 
      try 
      { 
        InputStream inputStream = new FileInputStream( "/tmp/TestData.xml" ); 
        MyDocument document = ( MyDocument ) documentBuilder.parse( inputStream ); 
        document.setDocumentName( "TestDocument" ); 
      } 
      finally 
      { 
        documentBuilder.close(); 
      } 
    } 
    catch ( Exception e ) 
    { 
      e.printStackTrace(); 
    } 
  } 
}

Re-open the already parsed node data file "/tmp/ParsedNodeData.dom", then open the Document called "TestDocument" that is the file.

MyDocumentBuilderFactory documentBuilderFactory = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
documentBuilderFactory.setStorageFileName( "/tmp/ParsedNodeData.dat" ); 
MyDocumentBuilder documentBuilder = ( MyDocumentBuilder ) documentBuilderFactory.newDocumentBuilder(); 
MyDOMImplementation domImplementation = ( MyDOMImplementation ) documentBuilder.getDOMImplementation(); 
MyDocument document = ( MyDocument ) domImplementation.getDocument( "TestDocument" );

Parse two XML files, "/tmp/File1.xml" and "/tmp/File2.xml" into two Documents, storing both their parsed node data into the file "/tmp/ParsedData.dom". Give the two Documents names of "Doc1" and "Doc2".

MyDocumentBuilderFactory documentBuilderFactory = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
documentBuilderFactory.setStorageFileName( "/tmp/ParsedNodeData.dom" ); 
MyDocumentBuilder documentBuilder = ( MyDocumentBuilder ) documentBuilderFactory.newDocumentBuilder(); 
try 
{ 
  InputStream inputStream1 = new FileInputStream( "/tmp/File1.xml" ); 
  MyDocument document1 = ( MyDocument ) documentBuilder.parse( inputStream1 ); 
  document1.setDocumentName( "Doc1" ); 
  InputStream inputStream2 = new FileInputStream( "/tmp/File2.xml" ); 
  MyDocument document2 = ( MyDocument ) documentBuilder.parse( inputStream2 ); 
  document2.setDocumentName( "Doc2" ); 
} 
finally 
{ 
  documentBuilder.close(); 
}

Create a new empty Document with a name of "Doc3" and add an element to it. The Document will be written to a disk DOM file called "/tmp/ParsedData.dom".

MyDocumentBuilderFactory documentBuilderFactory = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
documentBuilderFactory.setStorageFileName( "/tmp/ParsedNodeData.dom" ); 
MyDocumentBuilder documentBuilder = ( MyDocumentBuilder ) documentBuilderFactory.newDocumentBuilder(); 
try 
{ 
  MyDocument document = ( MyDocument ) documentBuilder.newDocument(); 
  document.setDocumentName( "Doc3" ); 
  Element element = document.createElement( "TestElement" ); 
  document.appendChild( element ); 
} 
finally 
{ 
  documentBuilder.close(); 
}

Parse two XML files, "/tmp/File1.xml" and "/tmp/File2.xml" into two Documents, storing both their parsed node data into two separate disk DOM files called "/tmp/ParsedData1.dom" and "/tmp/ParsedData2.dom". Give the two Documents names of "Doc1" and "Doc2".

MyDocumentBuilderFactory documentBuilderFactory = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
documentBuilderFactory.setStorageFileName( "/tmp/ParsedNodeData1.dom" ); 
MyDocumentBuilder documentBuilder1 = ( MyDocumentBuilder ) documentBuilderFactory.newDocumentBuilder(); 
try 
{ 
  InputStream inputStream = new FileInputStream( "/tmp/TestData.xml" ); 
  MyDocument document1 = ( MyDocument ) documentBuilder1.parse( inputStream ); 
  document1.setDocumentName( "Doc1" ); 
} 
finally 
{ 
  db.close(); 
}

Parse the XML file "/tmp/TestData.xml" into the disk DOM file "/tmp/ParsedData.dom". Run the XSLT script "/tmp/Transform.xsl" on the Document, and output the transformed data into the XML file "/tmp/TransformedData.xml".

import java.io.*; 
import javax.xml.parsers.*; 
import javax.xml.transform.*; 
import javax.xml.transform.dom.*; 
import javax.xml.transform.stream.*; 
import org.w3c.dom.*; 
import au.com.explodingsheep.persistentDOM.*; 
import au.com.explodingsheep.persistentDOM.documentBuilder.*; 
public class ExampleTransform 
{ 
  public static void main( String args[] ) 
  { 
    try 
    { 
      MyDocumentBuilderFactory dbf = ( MyDocumentBuilderFactory ) DocumentBuilderFactory.newInstance(); 
      dbf.setStorageFileName( "/tmp/MyDOM/ParsedData.dom" ); 
      dbf.setIgnoringElementContentWhitespace( true ); 
      MyDocumentBuilder db = ( MyDocumentBuilder ) dbf.newDocumentBuilder(); 
      InputStream is = new FileInputStream( "/tmp/TestData.xml" ); 
      try 
    { 
        Document srcDocument = db.parse( is ); 
        javax.xml.transform.TransformerFactory tFactory = 
        javax.xml.transform.TransformerFactory.newInstance(); 
        Source xslSource = new StreamSource( new FileInputStream( "/tmp/Tfansform.xsl" ) ); 
        Transformer transformer2 = tFactory.newTransformer( xslSource ); 
        DOMSource domSource = new DOMSource( srcDocument ); 
        StreamResult streamResult = new StreamResult( new FileOutputStream( "/tmp/TransformedData.xml" ) ); 
        transformer2.transform( domSource, streamResult ); 
      } 
      finally 
      { 
        is.close(); 
        db.close(); 
      } 
    } 
    catch ( Exception e ) 
    { 
      e.printStackTrace(); 
    } 
  } 
}

* Things to do next *

DOM level 2 compliance. Although all the methods are implemented I have not done any major testing on them.
Write an indexing system for elements and attributes so the Element method "getElementByTagName" is quicker.
Write an indexing system for node values. This is outside the DOM specification but would provide a fast way of retrieving or updating nodes based on their values. -Clean up the DefaultDTDParser class - it works but is not very nice.
Clean up the ExperimentalNodeStore class - it works but is not very nice.
The "insert" and "delete" methods in SimpleFileSystemIntIdentifierList are very slow. Although the method works well enough I am not that happy with the design, and it is optimized for appending new nodes and retrieval, not inserting and deleting entries.

* Questions, comments, problems *

If you have any comments, suggestions, criticisms, personal insults etc e-mail me anytime at caddydc@yahoo.com.au. If you have discover any bugs or have problems let me know and I will try extra hard to fix them as quick as possible.

*** What is the disk DOM? ***

*** How to use the disk DOM ***

*** How to tell the disk DOM where to write the parsed data to ***

*** Classes and interfaces of note ***

*** How to compile the source code ***

*** Note on using XLT - problems with xalan, using saxon ***

*** Performance ***

*** Problems ***

*** Examples ***

*** Things to do next ***

*** Questions, comments, problems ***