TaxonomyBaseFactoryFromRemoteZip

final class TaxonomyBaseFactoryFromRemoteZip(val createZipInputStream: () => ZipInputStream, val transformDocument: SaxonDocument => BackingDocumentApi)

TaxonomyBase factory from a remote (or local) taxonomy package ZIP file. The ZIP does not have to be a taxonomy package with META-INF/taxonomyPackage.xml file, but it does need to have a META-INF/catalog.xml file. Moreover, this catalog.xml file must be invertible, so that there is only one original URI per mapped URI! The catalog must also be consistent with the document URIs found during DTS discovery. Otherwise the loading of the documents whose URIs were found during DTS discovery will probably fail!

TaxonomyBase factory from a remote (or local) taxonomy package ZIP file. The ZIP does not have to be a taxonomy package with META-INF/taxonomyPackage.xml file, but it does need to have a META-INF/catalog.xml file. Moreover, this catalog.xml file must be invertible, so that there is only one original URI per mapped URI! The catalog must also be consistent with the document URIs found during DTS discovery. Otherwise the loading of the documents whose URIs were found during DTS discovery will probably fail!

Another thing to keep in mind is that each XML file in the ZIP stream will be parsed, even if most of them are not part of the DTS. Indeed, DTS discovery is done once all XML documents have been loaded. This may be unrealistic if the ZIP stream contains far too many ZIP entries, for example for unused taxonomy versions or parts of the taxonomy that fall outside of the DTSes we are interested in!

So if the catalog is not invertible or the ZIP contains far more documents than required, this TaxonomyBase factory is not usable.

Authors

Chris de Vreeze

Companion
object
class Object
trait Matchable
class Any

Value members

Concrete methods

def findDtsUris(entrypointUris: Set[URI], allTaxoDocs: Seq[SaxonDocument]): Set[URI]

Finds all URIs in the DTS given the entrypoint URIs passed. A superset of the DTS as document collection is passed as second parameter. If the result URI set is not a subset of the URIs of the passed document collection (as 2nd parameter), an exception is thrown.

Finds all URIs in the DTS given the entrypoint URIs passed. A superset of the DTS as document collection is passed as second parameter. If the result URI set is not a subset of the URIs of the passed document collection (as 2nd parameter), an exception is thrown.

def loadDts(entrypointUris: Set[URI]): TaxonomyBase

Loads a taxonomy as TaxonomyBase, from the given entrypoint URIs. This method calls method readAllXmlDocuments, and then calls the other overloaded loadDts method (which does not need the ZIP input stream anymore).

Loads a taxonomy as TaxonomyBase, from the given entrypoint URIs. This method calls method readAllXmlDocuments, and then calls the other overloaded loadDts method (which does not need the ZIP input stream anymore).

def loadDts(entrypointUris: Set[URI], xmlByteArrays: ListMap[String, ArraySeq[Byte]]): TaxonomyBase

Loads a taxonomy as TaxonomyBase, from the given entrypoint URIs. This is the method that calls all the other methods of this class, except for method readAllXmlDocuments (and of course the overloaded loadDts method).

Loads a taxonomy as TaxonomyBase, from the given entrypoint URIs. This is the method that calls all the other methods of this class, except for method readAllXmlDocuments (and of course the overloaded loadDts method).

The 2nd parameter is the Map from ZIP entry names to immutable byte arrays. The ZIP entry names are assumed to use Unix-style (file component) separators.

def locateAndParseCatalog(fileDataCollection: ListMap[String, ArraySeq[Byte]]): SimpleCatalog

Finds the "META-INF/catalog.xml" file in the file data collection, throws an exception if not found, and parses the catalog data by calling method "parseCatalog". The file data collection uses as Map keys the ZIP entry names, using Unix-style (file component) separators.

Finds the "META-INF/catalog.xml" file in the file data collection, throws an exception if not found, and parses the catalog data by calling method "parseCatalog". The file data collection uses as Map keys the ZIP entry names, using Unix-style (file component) separators.

def parseAllTaxonomyDocuments(fileDataCollection: ListMap[String, ArraySeq[Byte]], catalog: SimpleCatalog): IndexedSeq[SaxonDocument]

Parses all (taxonomy) documents, without knowing the DTS yet. It is required that the passed XML catalog is invertible, or else this method does not work. After calling this function all documents are there in order to compute the DTS and create the taxonomy (from a subset of those documents).

Parses all (taxonomy) documents, without knowing the DTS yet. It is required that the passed XML catalog is invertible, or else this method does not work. After calling this function all documents are there in order to compute the DTS and create the taxonomy (from a subset of those documents).

The first parameter is the Map from ZIP entry names to immutable byte arrays. The ZIP entry names are assumed to use Unix-style (file component) separators.

Implementation note: this function parses many documents in parallel, for speed.

def parseCatalog(fileData: ArraySeq[Byte]): SimpleCatalog

Parses the catalog file data (as immutable byte array) into a SimpleCatalog. The returned catalog has as document URI the relative URI "META-INF/catalog.xml".

Parses the catalog file data (as immutable byte array) into a SimpleCatalog. The returned catalog has as document URI the relative URI "META-INF/catalog.xml".

def readAllXmlDocuments(): ListMap[String, ArraySeq[Byte]]

Reads all XML documents in the ZIP stream into memory, not as parsed DOM trees, but as immutable byte arrays. More precisely, the result is a Map from ZIP entry names (following the ZipEntry.getName format on Unix) to immutable ArraySeq collections of bytes. Again, the ZIP entry names are assumed to use Unix-style (file component) separators.

Reads all XML documents in the ZIP stream into memory, not as parsed DOM trees, but as immutable byte arrays. More precisely, the result is a Map from ZIP entry names (following the ZipEntry.getName format on Unix) to immutable ArraySeq collections of bytes. Again, the ZIP entry names are assumed to use Unix-style (file component) separators.

After having this result, other code can safely turn this collection into a parallel collection of parsed XML documents, and it can also first grab the catalog.xml content and use it for computing the (original) URIs of the other documents.

Admittedly, reading all XML files in the ZIP stream into memory may have quite a memory footprint, which some time later is GC'ed. On the other hand, code that exploits parallelism with the result of this function as input can be quite simple to reason about (note the immutable byte arrays), without introducing any Futures and later block on them. Moreover, if we cannot even temporarily fill all files into byte arrays, are we sure we have enough memory for whatever the program does with the loaded taxonomy or taxonomies?

def withTransformDocuemnt(newTransformDocument: SaxonDocument => BackingDocumentApi): TaxonomyBaseFactoryFromRemoteZip

Concrete fields

val createZipInputStream: () => ZipInputStream
val transformDocument: SaxonDocument => BackingDocumentApi