ParquetStreams

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def fromParquet[T](path: String, options: Options = ParquetReader.Options(), filter: Filter = Filter.noopFilter)(implicit arg0: ParquetRecordDecoder[T]): Source[T, NotUsed]

Creates a akka.stream.scaladsl.Source that reads Parquet data from the specified path.
Creates a akka.stream.scaladsl.Source that reads Parquet data from the specified path. If there are multiple files at path then the order in which files are loaded is determined by underlying filesystem.
Path can refer to local file, HDFS, AWS S3, Google Storage, Azure, etc. Please refer to Hadoop client documentation or your data provider in order to know how to configure the connection.
Can read also partitioned directories. Filter applies also to partition values. Partition values are set as fields in read entities at path defined by partition name. Path can be a simple column name or a dot-separated path to nested field. Missing intermediate fields are automatically created for each read record.

Take note! that due to an issue with implicit resolution in Scala 2.11 you may need to define all parameters of ParquetStreams.fromParquet even if some have default values. It specifically refers to a case when you would like to omit options but define filter. Such situation doesn't appear in Scala 2.12 and 2.13.
T
type of data that represent the schema of the Parquet data, e.g.:
```
case class MyData(id: Long, name: String, created: java.sql.Timestamp)
```
path
URI to Parquet files, e.g.:
```
"file:///data/users"
```
options
configuration of how Parquet files should be read
filter
optional before-read filter; no filtering is applied by default; check Filter for more details
returns
The source of Parquet data
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toParquetSingleFile[T](path: String, options: Options = ParquetWriter.Options())(implicit arg0: ParquetRecordEncoder[T], arg1: ParquetSchemaResolver[T]): Sink[T, Future[Done]]

Creates a akka.stream.scaladsl.Sink that writes Parquet data to single file at the specified path (including file name).
Creates a akka.stream.scaladsl.Sink that writes Parquet data to single file at the specified path (including file name).
Path can refer to local file, HDFS, AWS S3, Google Storage, Azure, etc. Please refer to Hadoop client documentation or your data provider in order to know how to configure the connection.
T
type of data that represent the schema of the Parquet data, e.g.:
```
case class MyData(id: Long, name: String, created: java.sql.Timestamp)
```
path
URI to Parquet files, e.g.:
```
"file:///data/users/users-2019-01-01.parquet"
```
options
set of options that define how Parquet files will be created
returns
The sink that writes Parquet file
def toString(): String

Definition Classes
AnyRef → Any
def viaParquet[T](path: String): Builder[T, T]

Builds a flow that:
Builds a flow that:
- Is designed to write Parquet files indefinitely
- Is able to (optionally) partition data by a list of provided fields
- Flushes and rotates files after given number of rows is written or given time period elapses
- Outputs incoming message after it is written but can write an effect of provided message transformation.
T
type of message that flow is meant to accept
path
URI to Parquet files, e.g.:
```
"file:///data/users"
```
returns
Builder of ParquetPartitioningFlow
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Deprecated Value Members

def toParquetIndefinite[In, ToWrite, Mat](path: String, maxChunkSize: Int, chunkWriteTimeWindow: FiniteDuration, buildChunkPath: ChunkPathBuilder[In] = ChunkPathBuilder.default, preWriteTransformation: (In) ⇒ ToWrite = identity[In], postWriteSink: Sink[Seq[In], Mat] = Sink.ignore, options: Options = ParquetWriter.Options())(implicit arg0: ParquetWriterFactory[ToWrite]): Sink[In, Mat]

Creates a akka.stream.scaladsl.Sink that writes Parquet data to files at the specified path.
Creates a akka.stream.scaladsl.Sink that writes Parquet data to files at the specified path. Sink splits files when maxChunkSize is reached or time equal to chunkWriteTimeWindow elapses. Files are named and written to path according to buildChunkPath. By default path looks like
```
PATH/part-RANDOM_UUID.parquet
```
. Objects coming into sink can be optionally transformed using preWriteTransformation and later handled by means of postWriteSink after transformed object is saved to file.

Path can refer to local file, HDFS, AWS S3, Google Storage, Azure, etc. Please refer to Hadoop client documentation or your data provider in order to know how to configure the connection.
PATH/part-RANDOM_UUID.parquet Objects coming into sink can be optionally transformed using preWriteTransformation and later handled by means of postWriteSink after transformed object is saved to file.
Path can refer to local file, HDFS, AWS S3, Google Storage, Azure, etc. Please refer to Hadoop client documentation or your data provider in order to know how to configure the connection.
In
type of incoming objects
ToWrite
type of data that represent the schema of the Parquet data, e.g.:
```
case class MyData(id: Long, name: String, created: java.sql.Timestamp)
```
Mat
type of sink's materalized value
path
URI to Parquet files, e.g.:
```
"file:///data/users"
```
maxChunkSize
maximum number of records that can be saved to single parquet file
chunkWriteTimeWindow
maximum time that sink will wait before saving (non-empty) file
buildChunkPath
factory function to define custom path for each saved file
preWriteTransformation
function that transforms incoming object into data that are written to the file
postWriteSink
allows to to define action to be taken after each incoming object is successfully written to the file
options
set of options that define how Parquet files will be created
returns
The sink that writes Parquet files
Annotations
@deprecated
Deprecated
(Since version 1.3.0) Use viaParquet instead
def toParquetParallelUnordered[T](path: String, parallelism: Int, options: Options = ParquetWriter.Options())(implicit arg0: ParquetRecordEncoder[T], arg1: ParquetSchemaResolver[T]): Sink[T, Future[Done]]

Creates a akka.stream.scaladsl.Sink that writes Parquet data to files at the specified path.
Creates a akka.stream.scaladsl.Sink that writes Parquet data to files at the specified path. Sink splits files into number of pieces equal to parallelism. Files are written in parallel. Data is written in unordered way.

Path can refer to local file, HDFS, AWS S3, Google Storage, Azure, etc. Please refer to Hadoop client documentation or your data provider in order to know how to configure the connection.
T
type of data that represent the schema of the Parquet data, e.g.:
```
case class MyData(id: Long, name: String, created: java.sql.Timestamp)
```
path
URI to Parquet files, e.g.:
```
"file:///data/users"
```
parallelism
defines how many files are created and how many parallel threads are responsible for it
options
set of options that define how Parquet files will be created
returns
The sink that writes Parquet files
Annotations
@deprecated
Deprecated
(Since version 1.4.0) In the future only viaParquet and toParquetSingleFile may be only supported writers
def toParquetSequentialWithFileSplit[T](path: String, maxRecordsPerFile: Long, options: Options = ParquetWriter.Options())(implicit arg0: ParquetRecordEncoder[T], arg1: ParquetSchemaResolver[T]): Sink[T, Future[Done]]

Creates a akka.stream.scaladsl.Sink that writes Parquet data to files at the specified path.
Creates a akka.stream.scaladsl.Sink that writes Parquet data to files at the specified path. Sink splits files sequentially into pieces. Each file contains maximal number of records according to maxRecordsPerFile. It is recommended to define maxRecordsPerFile as a multiple of com.github.mjakubowski84.parquet4s.ParquetWriter.Options.rowGroupSize.

Path can refer to local file, HDFS, AWS S3, Google Storage, Azure, etc. Please refer to Hadoop client documentation or your data provider in order to know how to configure the connection.
T
type of data that represent the schema of the Parquet data, e.g.:
```
case class MyData(id: Long, name: String, created: java.sql.Timestamp)
```
path
URI to Parquet files, e.g.:
```
"file:///data/users"
```
maxRecordsPerFile
the maximum size of file
options
set of options that define how Parquet files will be created
returns
The sink that writes Parquet files
Annotations
@deprecated
Deprecated
(Since version 1.4.0) In the future only viaParquet and toParquetSingleFile may be only supported writers

Related Doc: package parquet4s

object ParquetStreams

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

def fromParquet[T](path: String, options: Options = ParquetReader.Options(), filter: Filter = Filter.noopFilter)(implicit arg0: ParquetRecordDecoder[T]): Source[T, NotUsed]

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toParquetSingleFile[T](path: String, options: Options = ParquetWriter.Options())(implicit arg0: ParquetRecordEncoder[T], arg1: ParquetSchemaResolver[T]): Sink[T, Future[Done]]

def toString(): String

def viaParquet[T](path: String): Builder[T, T]

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Deprecated Value Members

def toParquetParallelUnordered[T](path: String, parallelism: Int, options: Options = ParquetWriter.Options())(implicit arg0: ParquetRecordEncoder[T], arg1: ParquetSchemaResolver[T]): Sink[T, Future[Done]]

def toParquetSequentialWithFileSplit[T](path: String, maxRecordsPerFile: Long, options: Options = ParquetWriter.Options())(implicit arg0: ParquetRecordEncoder[T], arg1: ParquetSchemaResolver[T]): Sink[T, Future[Done]]

Inherited from AnyRef

Inherited from Any

Ungrouped