Class StreamingXMLReader

java.lang.Object
com.thirdpartylabs.xmlscalpel.io.reader.StreamingXMLReader

public class StreamingXMLReader
extends java.lang.Object
Streaming XML file reader that uses the Woodstox stream reader to extract top level XML nodes along with metadata describing their location in the XML file, and send them to an XMLStreamProcessor.

Using the streaming reader allows large files to be processed without significant overhead.

  • Constructor Details

    • StreamingXMLReader

      public StreamingXMLReader() throws javax.xml.transform.TransformerConfigurationException
      Throws:
      javax.xml.transform.TransformerConfigurationException
  • Method Details

    • readFile

      public void readFile​(java.io.File file, XMLStreamProcessor processor, java.util.List<java.lang.String> targetPaths) throws java.io.FileNotFoundException, javax.xml.stream.XMLStreamException, javax.xml.transform.TransformerException
      Read an XML file using the Woodstox streaming API and supply the XMLStreamProcessor with Fragment objects. Specify a List of node paths to extract. Example:
       
       <xml>
       <Feed>
           <Category>
               <Name>Bolts</Name>
               <Product>Large</Product>
               <Product>Small</Product>
               <Services>
                   <Service>Tightening</Service>
                   <Service>Loosening</Service>
               </Services>
           </Category>
           <Category>
               <Name>Hammers</Name>
               <Product>Framing</Product>
               <Product>Dead Blow</Product>
               <Services>
                   <Service>Banging</Service>
               </Services>
           </Category>
       </Feed>
       
       

      You can extract all product and service elements in the same read operation by passing in these paths:
      /feed/category/Product
      /feed/category/Services/Service

      Namespace prefixes may be specified as they appear in the XML: /aw:PurchaseOrders/aw:PurchaseOrder/aw:Address

      Paths are absolute with respect to the document root, they will be normalized to always have a leading slash and never have a trailing slash. Overlapping paths are not supported, the least specific path will be used in such a case.

      Fragment objects wrap the dom node as a DocumentFragment and an XMLByteLocation object that describes the node's location in the XML file. This allows efficient retrieval of the nodes later using the RandomAccessXMLReader

      Parameters:
      file - The XML file to process
      processor - XMLStreamProcessor instance
      targetPaths - List of node paths to target for extraction
      Throws:
      java.io.FileNotFoundException
      javax.xml.stream.XMLStreamException
      javax.xml.transform.TransformerException
    • readFile

      public void readFile​(java.io.File file, XMLStreamProcessor processor) throws java.io.FileNotFoundException, javax.xml.stream.XMLStreamException, javax.xml.transform.TransformerException
      Read an XML file using the Woodstox streaming API and supply the XMLStreamProcessor with Fragment objects.

      All (and only) top level elements are returned. For example, given an XML file with a structure like

       
       <feed>
        <product></product>
        <product></product>
        <product></product>
       </feed>
       
       

      All product nodes will be returned.

      Fragment objects wrap the dom node as a DocumentFragment and an XMLByteLocation object that describes the node's location in the XML file. This allows efficient retrieval of the nodes later using the RandomAccessXMLReader

      Parameters:
      file - The XML file to process
      processor - XMLStreamProcessor instance
      Throws:
      java.io.FileNotFoundException
      javax.xml.stream.XMLStreamException
      javax.xml.transform.TransformerException
    • getDocumentElementAttributes

      public java.util.Map<java.lang.String,​java.lang.String> getDocumentElementAttributes()
      A map containing the attribute name-value pairs from the document element
      Returns:
      Map<String, String>
    • getDocumentElementAttributeNamespaces

      public java.util.Map<java.lang.String,​java.lang.String> getDocumentElementAttributeNamespaces()
      A map containing the namespace prefix to URI pairs from the document element
      Returns:
      Map<String, String>
    • getDocumentElementTagName

      public java.lang.String getDocumentElementTagName()
      The local name of the document element tag
      Returns:
      The local name of the document element tag
    • getPrefix

      public java.lang.String getPrefix()
      Returns the prefix of the current event or null if the event does not have a prefix
      Returns:
      the prefix or null
    • getCharacterEncodingScheme

      public java.lang.String getCharacterEncodingScheme()
      Returns the character encoding declared on the xml declaration Returns null if none was declared
      Returns:
      the encoding declared in the document or null
      See Also:
      XMLStreamReader
    • getEncoding

      public java.lang.String getEncoding()
      Return input encoding if known or null if unknown.
      Returns:
      the encoding of this instance or null
      See Also:
      XMLStreamReader
    • getVersion

      public java.lang.String getVersion()
      Get the xml version declared on the xml declaration Returns null if none was declared
      Returns:
      the XML version or null
      See Also:
      XMLStreamReader
    • getEmptyDocument

      public org.w3c.dom.Document getEmptyDocument​(java.io.File file) throws java.io.FileNotFoundException, javax.xml.stream.XMLStreamException, javax.xml.parsers.ParserConfigurationException, javax.management.modelmbean.XMLParseException
      Parameters:
      file - XML file to extract an empty document for
      Returns:
      Document containing only the document element from the file provided
      Throws:
      java.io.FileNotFoundException
      javax.xml.stream.XMLStreamException
      javax.xml.parsers.ParserConfigurationException
      javax.management.modelmbean.XMLParseException
    • getEmptyDocument

      public org.w3c.dom.Document getEmptyDocument() throws javax.xml.parsers.ParserConfigurationException, javax.management.modelmbean.XMLParseException
      Returns:
      Document containing only the document element from the last file provided to this instance of StreamingXMLReader
      Throws:
      javax.xml.parsers.ParserConfigurationException
      javax.management.modelmbean.XMLParseException
    • getOuterDocument

      public OuterDocument getOuterDocument​(java.io.File file) throws java.lang.Exception
      Parameters:
      file - XML file to parse into an OuterDocument
      Returns:
      OuterDocument wrapper containing the empty Document containing only the document element from the file provided
      Throws:
      java.lang.Exception
    • getOuterDocument

      public OuterDocument getOuterDocument() throws java.lang.Exception
      Returns:
      OuterDocument wrapper containing the empty Document containing only the document element from the last file provided to this instance of StreamingXMLReader
      Throws:
      java.lang.Exception