Class WarcReader

java.lang.Object
org.netpreserve.jwarc.WarcReader
All Implemented Interfaces:
Closeable, AutoCloseable, Iterable<WarcRecord>

public class WarcReader extends Object implements Iterable<WarcRecord>, Closeable
  • Constructor Details

  • Method Details

    • next

      public Optional<WarcRecord> next() throws IOException
      Reads the next WARC record.

      This method will construct an appropriate subclass of WarcRecord based on the value of the WARC-Type header. New types may be registered using registerType(String, WarcRecord.Constructor).

      The body channel of any previously read record will be closed.

      Returns:
      a instance of WarcRecord or an empty Optional at the end of the channel.
      Throws:
      IOException - if an I/O error occurs.
      ParsingException - if the WARC record is invalid.
    • registerType

      public void registerType(String type, WarcRecord.Constructor<WarcRecord> constructor)
      Registers a new extension record type.

      Builtin types like "resource" and "response" may be overridden with a subclass that adds extension methods. The special type name "default" is used when a unregistered record type is encountered.

      Parameters:
      type - a value of the WARC-Type header
      constructor - a constructor for a corresponding subclass of WarcRecord
    • calculateBlockDigest

      public void calculateBlockDigest()
      Enable calculation of block digests for all WARC records which include the header "WARC-Block-Digest" and using the same digest algorithm as mentioned in the header. The actually calculated record digests (WarcRecord.calculatedBlockDigest()) can be then compared to the pre-calculated digests (WarcRecord.blockDigest()). See also DigestingMessageBody.
    • position

      public long position()
      Returns the byte position of the most recently read record.

      For compressed WARCs this method will only return a meaningful value if the compression was applied in such a way that the start of a new record corresponds to the start of a compression block.

    • position

      public void position(long newPosition) throws IOException
      Seeks to the record at the given position in the underlying channel.
      Parameters:
      newPosition - byte offset of the beginning of the record to seek to
      Throws:
      IOException - if an I/O error occurs
      IllegalArgumentException - if the position is negative
      UnsupportedOperationException - if the underlying channel does not support seeking
    • compression

      public WarcCompression compression()
      The type of WARC compression that was detected.
    • iterator

      public Iterator<WarcRecord> iterator()
      Returns an iterator over the records in the WARC file.
      Specified by:
      iterator in interface Iterable<WarcRecord>
    • records

      public Stream<WarcRecord> records()
      Returns a Stream over the records in the WARC file.
    • onWarning

      public void onWarning(Consumer<String> warningHandler)
      Registers a handler that will be called when the reader encounters an error it was able to recover from.
    • setLenient

      public void setLenient(boolean lenient)
      Sets the lenient mode for the WarcParser.

      When enabled, this causes the parser to follow the specification less strictly, allowing reading of non-compliant records by:

      • permitting ASCII control characters in header field names and values
      • allowing lines to end with LF instead of CRLF
      • permitting multi-digit WARC minor versions like "0.18"
    • close

      public void close() throws IOException
      Closes the underlying channel.
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Throws:
      IOException