TaxonomyBaseFactoryFromRemoteZip

Finds all URIs in the DTS given the entrypoint URIs passed. A superset of the DTS as document collection is passed as second parameter. If the result URI set is not a subset of the URIs of the passed document collection (as 2nd parameter), an exception is thrown.

Loads a taxonomy as TaxonomyBase, from the given entrypoint URIs. This method calls method readAllXmlDocuments, and then calls the other overloaded loadDts method (which does not need the ZIP input stream anymore).

Loads a taxonomy as TaxonomyBase, from the given entrypoint URIs. This is the method that calls all the other methods of this class, except for method readAllXmlDocuments (and of course the overloaded loadDts method).

The 2nd parameter is the Map from ZIP entry names to immutable byte arrays. The ZIP entry names are assumed to use Unix-style (file component) separators.

Finds the "META-INF/catalog.xml" file in the file data collection, throws an exception if not found, and parses the catalog data by calling method "parseCatalog". The file data collection uses as Map keys the ZIP entry names, using Unix-style (file component) separators.

Parses all (taxonomy) documents, without knowing the DTS yet. It is required that the passed XML catalog is invertible, or else this method does not work. After calling this function all documents are there in order to compute the DTS and create the taxonomy (from a subset of those documents).

The first parameter is the Map from ZIP entry names to immutable byte arrays. The ZIP entry names are assumed to use Unix-style (file component) separators.

Implementation note: this function parses many documents in parallel, for speed.

Parses the catalog file data (as immutable byte array) into a SimpleCatalog. The returned catalog has as document URI the relative URI "META-INF/catalog.xml".

Reads all XML documents in the ZIP stream into memory, not as parsed DOM trees, but as immutable byte arrays. More precisely, the result is a Map from ZIP entry names (following the ZipEntry.getName format on Unix) to immutable ArraySeq collections of bytes. Again, the ZIP entry names are assumed to use Unix-style (file component) separators.

After having this result, other code can safely turn this collection into a parallel collection of parsed XML documents, and it can also first grab the catalog.xml content and use it for computing the (original) URIs of the other documents.

Admittedly, reading all XML files in the ZIP stream into memory may have quite a memory footprint, which some time later is GC'ed. On the other hand, code that exploits parallelism with the result of this function as input can be quite simple to reason about (note the immutable byte arrays), without introducing any Futures and later block on them. Moreover, if we cannot even temporarily fill all files into byte arrays, are we sure we have enough memory for whatever the program does with the loaded taxonomy or taxonomies?

TaxonomyBaseFactoryFromRemoteZip

Value members

Concrete methods

Concrete fields