Class WarcReader
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Iterable<WarcRecord>
-
Constructor Summary
ConstructorsConstructorDescriptionWarcReader
(InputStream stream) WarcReader
(ReadableByteChannel channel) WarcReader
(ReadableByteChannel channel, ByteBuffer buffer) Create WarcReader with user-provided buffer.WarcReader
(Path path) -
Method Summary
Modifier and TypeMethodDescriptionvoid
Enable calculation of block digests for all WARC records which include the header "WARC-Block-Digest" and using the same digest algorithm as mentioned in the header.void
close()
Closes the underlying channel.The type of WARC compression that was detected.iterator()
Returns an iterator over the records in the WARC file.next()
Reads the next WARC record.void
Registers a handler that will be called when the reader encounters an error it was able to recover from.long
position()
Returns the byte position of the most recently read record.void
position
(long newPosition) Seeks to the record at the given position in the underlying channel.records()
Returns a Stream over the records in the WARC file.void
registerType
(String type, WarcRecord.Constructor<WarcRecord> constructor) Registers a new extension record type.void
setLenient
(boolean lenient) Sets the lenient mode for the WarcParser.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Constructor Details
-
WarcReader
public WarcReader(ReadableByteChannel channel, ByteBuffer buffer) throws IOException, IllegalArgumentException Create WarcReader with user-provided buffer. Data contained in the buffer is used as initial input before reading from the input channel. The buffer must be ready for reading (Buffer.flip()
called).- Parameters:
channel
- read WARC data frombuffer
- buffer to read initial data from, later used to buffer data from channel- Throws:
IOException
IllegalArgumentException
- if buffer is not readable or is not backed by an array
-
WarcReader
- Throws:
IOException
-
WarcReader
- Throws:
IOException
-
WarcReader
- Throws:
IOException
-
-
Method Details
-
next
Reads the next WARC record.This method will construct an appropriate subclass of
WarcRecord
based on the value of theWARC-Type
header. New types may be registered usingregisterType(String, WarcRecord.Constructor)
.The body channel of any previously read record will be closed.
- Returns:
- a instance of
WarcRecord
or an emptyOptional
at the end of the channel. - Throws:
IOException
- if an I/O error occurs.ParsingException
- if the WARC record is invalid.
-
registerType
Registers a new extension record type.Builtin types like "resource" and "response" may be overridden with a subclass that adds extension methods. The special type name "default" is used when a unregistered record type is encountered.
- Parameters:
type
- a value of the WARC-Type headerconstructor
- a constructor for a corresponding subclass of WarcRecord
-
calculateBlockDigest
public void calculateBlockDigest()Enable calculation of block digests for all WARC records which include the header "WARC-Block-Digest" and using the same digest algorithm as mentioned in the header. The actually calculated record digests (WarcRecord.calculatedBlockDigest()
) can be then compared to the pre-calculated digests (WarcRecord.blockDigest()
). See alsoDigestingMessageBody
. -
position
public long position()Returns the byte position of the most recently read record.For compressed WARCs this method will only return a meaningful value if the compression was applied in such a way that the start of a new record corresponds to the start of a compression block.
-
position
Seeks to the record at the given position in the underlying channel.- Parameters:
newPosition
- byte offset of the beginning of the record to seek to- Throws:
IOException
- if an I/O error occursIllegalArgumentException
- if the position is negativeUnsupportedOperationException
- if the underlying channel does not support seeking
-
compression
The type of WARC compression that was detected. -
iterator
Returns an iterator over the records in the WARC file.- Specified by:
iterator
in interfaceIterable<WarcRecord>
-
records
Returns a Stream over the records in the WARC file. -
onWarning
-
setLenient
public void setLenient(boolean lenient) Sets the lenient mode for the WarcParser.When enabled, this causes the parser to follow the specification less strictly, allowing reading of non-compliant records by:
- permitting ASCII control characters in header field names and values
- allowing lines to end with LF instead of CRLF
- permitting multi-digit WARC minor versions like "0.18"
-
close
Closes the underlying channel.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Throws:
IOException
-