eu.cdevreeze.yaidom

parse

package parse

Support for parsing XML into yaidom Documents and Elems. This package offers the eu.cdevreeze.yaidom.parse.DocumentParser trait, as well as several implementations. Those implementations use JAXP (SAX, DOM or StAX), and most of them use the convert package to convert JAXP artifacts to yaidom Documents.

For example:

val docParser = DocumentParserUsingSax.newInstance()

val doc: Document = docParser.parse(docUri)

This example chose a SAX-based implementation, and used the default configuration of that document parser.

Having several different fully configurable JAXP-based implementations shows that yaidom is pessimistic about the transparency of parsing and printing XML. It also shows that yaidom is optimistic about the available (heap) memory and processing power, because of the 2 separated steps of JAXP parsing/printing and (in-memory) convert conversions. Using JAXP means that escaping of characters is something that JAXP deals with, and that's definitely better than trying to do it yourself.

One DocumentParser implementation does not use any convert conversion. That is DocumentParserUsingSax. It is likely the fastest of the DocumentParser implementations.

The preferred DocumentParser for XML (not HTML) parsing is DocumentParserUsingDomLS, if memory usage is not an issue. This DocumentParser implementation is best integrated with DOM, and is highly configurable, although DOM LS configuration is somewhat involved.

This package depends on the eu.cdevreeze.yaidom.core, eu.cdevreeze.yaidom.queryapi, eu.cdevreeze.yaidom.simple and eu.cdevreeze.yaidom.convert packages, and not the other way around.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. parse
  2. AnyRef
  3. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Type Members

  1. abstract class AbstractDocumentParser extends DocumentParser

    Partial DocumentParser implementation, leaving only one of the parse methods abstract.

  2. trait DefaultElemProducingSaxHandler extends DefaultHandler with ElemProducingSaxHandler with LexicalHandler

    Default eu.cdevreeze.yaidom.parse.ElemProducingSaxHandler implementation.

    Default eu.cdevreeze.yaidom.parse.ElemProducingSaxHandler implementation.

    This is a trait instead of a class, so it is easy to mix in EntityResolvers, ErrorHandlers, etc.

    Annotations
    @NotThreadSafe()
  3. trait DocumentParser extends AnyRef

    eu.cdevreeze.yaidom.simple.Document parser.

    eu.cdevreeze.yaidom.simple.Document parser. This trait is purely abstract.

    Implementing classes deal with the details of parsing XML strings/streams into yaidom Documents. The eu.cdevreeze.yaidom.simple package itself is agnostic of those details.

    Typical implementations use DOM, StAX or SAX, but make them easier to use in the tradition of the "template" classes of the Spring framework. That is, resource management is done as much as possible by the DocumentParser, typical usage is easy, and complex scenarios are still possible. The idea is that the parser is configured once, and that it should be re-usable multiple times.

    One of the parse methods takes an InputStream instead of Source object, because that works better with a DOM implementation.

    Although DocumentParser instances should be re-usable multiple times, implementing classes are encouraged to indicate to what extent re-use of a parser instance is indeed supported (single-threaded, or even multi-threaded).

  4. final class DocumentParserUsingDom extends AbstractDocumentParser

    DOM-based Document parser.

    DOM-based Document parser.

    Typical non-trivial creation is as follows, assuming class MyEntityResolver, which extends EntityResolver, and class MyErrorHandler, which extends ErrorHandler:

    val dbf = DocumentBuilderFactory.newInstance()
    dbf.setNamespaceAware(true)
    
    def createDocumentBuilder(dbf: DocumentBuilderFactory): DocumentBuilder = {
      val db = dbf.newDocumentBuilder()
      db.setEntityResolver(new MyEntityResolver)
      db.setErrorHandler(new MyErrorHandler)
      db
    }
    
    val docParser = DocumentParserUsingDom.newInstance(dbf, createDocumentBuilder _)

    If we want the DocumentBuilderFactory to be a validating one, using an XML Schema, we could obtain the DocumentBuilderFactory as follows:

    val schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
    val schemaSource = new StreamSource(new File(pathToSchema))
    val schema = schemaFactory.newSchema(schemaSource)
    
    val dbf = {
      val result = DocumentBuilderFactory.newInstance()
      result.setNamespaceAware(true)
      result.setSchema(schema)
      result
    }

    A custom EntityResolver could be used to retrieve DTDs locally, or even to suppress DTD resolution. The latter can be coded as follows (see http://stuartsierra.com/2008/05/08/stop-your-java-sax-parser-from-downloading-dtds), risking some loss of information:

    class MyEntityResolver extends EntityResolver {
      override def resolveEntity(publicId: String, systemId: String): InputSource = {
        // This dirty hack may not work on IBM JVMs
        new InputSource(new java.io.StringReader(""))
      }
    }

    For completeness, a custom ErrorHandler class that simply prints parse exceptions to standard output:

    class MyErrorHandler extends ErrorHandler {
      def warning(exc: SAXParseException): Unit = { println(exc) }
      def error(exc: SAXParseException): Unit = { println(exc) }
      def fatalError(exc: SAXParseException): Unit = { println(exc) }
    }

    If more flexibility is needed in configuring the DocumentParser than offered by this class, consider writing a wrapper DocumentParser which wraps a DocumentParserUsingDom, but adapts the parse method. This would make it possible to adapt the conversion from a DOM Document to yaidom Document, for example.

    A DocumentParserUsingDom instance can be re-used multiple times, from the same thread. If the DocumentBuilderFactory is thread-safe, it can even be re-used from multiple threads. Typically a DocumentBuilderFactory cannot be trusted to be thread-safe, however. In a web application, one (safe) way to deal with that is to use one DocumentBuilderFactory instance per request.

  5. final class DocumentParserUsingDomLS extends AbstractDocumentParser

    DOM-LS-based Document parser.

    DOM-LS-based Document parser.

    Typical non-trivial creation is as follows, assuming class MyEntityResolver, which extends LSResourceResolver, and class MyErrorHandler, which extends DOMErrorHandler:

    def createParser(domImplLS: DOMImplementationLS): LSParser = {
      val parser = domImplLS.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null)
      parser.getDomConfig.setParameter("resource-resolver", new MyEntityResolver)
      parser.getDomConfig.setParameter("error-handler", new MyErrorHandler)
      parser
    }
    
    val domParser = DocumentParserUsingDomLS.newInstance().withParserCreator(createParser _)

    A custom LSResourceResolver could be used to retrieve DTDs locally, or even to suppress DTD resolution. The latter can be coded as follows (see http://stuartsierra.com/2008/05/08/stop-your-java-sax-parser-from-downloading-dtds), risking some loss of information:

    class MyEntityResolver extends LSResourceResolver {
      override def resolveResource(tpe: String, namespaceURI: String, publicId: String, systemId: String, baseURI: String): LSInput = {
        val input = domImplLS.createLSInput()
        // This dirty hack may not work on IBM JVMs
        input.setCharacterStream(new jio.StringReader(""))
        input
      }
    }

    For completeness, a custom DOMErrorHandler class that simply throws an exception:

    class MyErrorHandler extends DOMErrorHandler {
      override def handleError(exc: DOMError): Boolean = {
        sys.error(exc.toString)
      }
    }

    If more flexibility is needed in configuring the DocumentParser than offered by this class, consider writing a wrapper DocumentParser which wraps a DocumentParserUsingDomLS, but adapts the parse method. This would make it possible to set an encoding on the LSInput, for example. As another example, this would allow for adapting the conversion from a DOM Document to yaidom Document.

    A DocumentParserUsingDomLS instance can be re-used multiple times, from the same thread. If the DOMImplementationLS is thread-safe, it can even be re-used from multiple threads. Typically a DOMImplementationLS cannot be trusted to be thread-safe, however. In a web application, one (safe) way to deal with that is to use one DOMImplementationLS instance per request.

  6. final class DocumentParserUsingSax extends AbstractDocumentParser

    SAX-based Document parser.

    SAX-based Document parser.

    Typical non-trivial creation is as follows, assuming a trait MyEntityResolver, which extends EntityResolver, and a trait MyErrorHandler, which extends ErrorHandler:

    val spf = SAXParserFactory.newInstance
    spf.setFeature("http://xml.org/sax/features/namespaces", true)
    spf.setFeature("http://xml.org/sax/features/namespace-prefixes", true)
    
    val parser = DocumentParserUsingSax.newInstance(
      spf,
      () => new DefaultElemProducingSaxHandler with MyEntityResolver with MyErrorHandler
    )

    If we want the SAXParserFactory to be a validating one, using an XML Schema, we could obtain the SAXParserFactory as follows:

    val schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
    val schemaSource = new StreamSource(new File(pathToSchema))
    val schema = schemaFactory.newSchema(schemaSource)
    
    val spf = {
      val result = SAXParserFactory.newInstance()
      result.setFeature("http://xml.org/sax/features/namespaces", true)
      result.setFeature("http://xml.org/sax/features/namespace-prefixes", true)
      result.setSchema(schema)
      result
    }

    A custom EntityResolver could be used to retrieve DTDs locally, or even to suppress DTD resolution. The latter can be coded as follows (see http://stuartsierra.com/2008/05/08/stop-your-java-sax-parser-from-downloading-dtds), risking some loss of information:

    trait MyEntityResolver extends EntityResolver {
      override def resolveEntity(publicId: String, systemId: String): InputSource = {
        // This dirty hack may not work on IBM JVMs
        new InputSource(new java.io.StringReader(""))
      }
    }

    For completeness, a custom ErrorHandler trait that simply prints parse exceptions to standard output:

    trait MyErrorHandler extends ErrorHandler {
      override def warning(exc: SAXParseException): Unit = { println(exc) }
      override def error(exc: SAXParseException): Unit = { println(exc) }
      override def fatalError(exc: SAXParseException): Unit = { println(exc) }
    }

    It is even possible to parse HTML (including very poor HTML) into well-formed Documents by using a SAXParserFactory from the TagSoup library. For example:

    val parser = DocumentParserUsingSax.newInstance(new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl)

    If more flexibility is needed in configuring the DocumentParser than offered by this class, consider writing a wrapper DocumentParser which wraps a DocumentParserUsingSax, but adapts the parse method. This would make it possible to set additional properties on the XML Reader, for example.

    As can be seen above, parsing is based on the JAXP SAXParserFactory instead of the SAX 2.0 XMLReaderFactory.

    A DocumentParserUsingSax instance can be re-used multiple times, from the same thread. If the SAXParserFactory is thread-safe, it can even be re-used from multiple threads. Typically a SAXParserFactory cannot be trusted to be thread-safe, however. In a web application, one (safe) way to deal with that is to use one SAXParserFactory instance per request.

  7. final class DocumentParserUsingStax extends AbstractDocumentParser

    StAX-based Document parser.

    StAX-based Document parser.

    Typical non-trivial creation is as follows, assuming a class MyXmlResolver, which extends XMLResolver, and a class MyXmlReporter, which extends XMLReporter:

    val xmlInputFactory = XMLInputFactory.newFactory()
    xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, java.lang.Boolean.TRUE)
    xmlInputFactory.setXMLResolver(new MyXmlResolver)
    xmlInputFactory.setXMLReporter(new MyXmlReporter)
    
    val docParser = DocumentParserUsingStax.newInstance(xmlInputFactory)

    A custom XMLResolver could be used to retrieve DTDs locally, or even to suppress DTD resolution. The latter can be coded as follows (compare with http://stuartsierra.com/2008/05/08/stop-your-java-sax-parser-from-downloading-dtds), risking some loss of information:

    class MyXmlResolver extends XMLResolver {
      override def resolveEntity(publicId: String, systemId: String, baseUri: String, namespace: String): Any = {
        // This dirty hack may not work on IBM JVMs
        new java.io.StringReader("")
      }
    }

    A trivial XMLReporter could look like this:

    class MyXmlReporter extends XMLReporter {
      override def report(message: String, errorType: String, relatedInformation: AnyRef, location: Location): Unit = {
        println("Location: %s. Error type: %s. Message: %s.".format(location, errorType, message))
      }
    }

    If more flexibility is needed in configuring the DocumentParser than offered by this class, consider writing a wrapper DocumentParser which wraps a DocumentParserUsingStax, but adapts the parse method. This would make it possible to adapt the conversion from StAX events to yaidom Document, for example.

    A DocumentParserUsingStax instance can be re-used multiple times, from the same thread. If the XMLInputFactory is thread-safe, it can even be re-used from multiple threads. Typically a XMLInputFactory cannot be trusted to be thread-safe, however. In a web application, one (safe) way to deal with that is to use one XMLInputFactory instance per request.

  8. trait ElemProducingSaxHandler extends DefaultHandler

    Contract of a SAX ContentHandler that, once ready, can be asked for the resulting eu.cdevreeze.yaidom.simple.Elem using method resultingElem, or the resulting eu.cdevreeze.yaidom.simple.Document using method resultingDocument.

  9. trait SaxHandlerWithLocator extends DefaultHandler

    Mixin extending DefaultHandler that contains a Locator.

    Mixin extending DefaultHandler that contains a Locator. Typically this Locator is used by an ErrorHandler mixed in after this trait.

    Annotations
    @NotThreadSafe()
  10. final class ThreadLocalDocumentParser extends AbstractDocumentParser

    Thread-local DocumentParser.

    Thread-local DocumentParser. This class exists because typical JAXP factory objects (DocumentBuilderFactory etc.) are not thread-safe, but still expensive to create. Using this DocumentParser facade backed by a thread local DocumentParser, we can create a ThreadLocalDocumentParser once, and re-use it all the time without having to worry about thread-safety issues.

    Note that each ThreadLocalDocumentParser instance (!) has its own thread-local document parser. Typically it makes no sense to have more than one ThreadLocalDocumentParser instance in one application. In a Spring application, for example, a single instance of a ThreadLocalDocumentParser can be configured.

Inherited from AnyRef

Inherited from Any

Ungrouped