io.eels.datastream

DataStream

trait DataStream extends Logging

A DataStream is kind of like a table of data. It has fields (like columns) and rows of data. Each row has an entry for each field (this may be null depending on the field definition).

It is a lazily evaluated data structure. Each operation on a stream will create a new derived stream, but those operations will only occur when a final action is performed.

You can create a DataStream from an IO source, such as a Parquet file or a Hive table, or you may create a fully evaluated one from an in memory structure. In the case of the former, the data will only be loaded on demand as an action is performed.

A DataStream is split into one or more flows. Each flow can operate independantly of the others. For example, if you filter a flow, each flow will be filtered seperately, which allows it to be parallelized. If you write out a flow, each partition can be written out to individual files, again allowing parallelization.

Self Type
DataStream
Linear Supertypes
Logging, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. DataStream
  2. Logging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Abstract Value Members

  1. abstract def schema: StructType

  2. abstract def subscribe(subscriber: Subscriber[Seq[Row]]): Unit

Concrete Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. def ++(other: DataStream): DataStream

    Joins two streams together, such that the elements of the given datastream are appended to the end of this datastream.

  5. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  6. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  7. def addField(field: Field, defaultValue: Any, errorIfFieldExists: Boolean): DataStream

  8. def addField(name: Field, defaultValue: Any): DataStream

    Returns a new DataStream with the given field added at the end.

    Returns a new DataStream with the given field added at the end. The value of this field for each Row is specified by the default value. The value must be compatible with the field definition. Eg, an error will occur if the field has type Int and the default value was 1.3

  9. def addField(field: Field, expression: Expression, errorIfFieldExists: Boolean): DataStream

  10. def addField(field: Field, expression: Expression): DataStream

  11. def addField(name: String, defaultValue: String, errorIfFieldExists: Boolean): DataStream

  12. def addField(name: String, defaultValue: String): DataStream

    Returns a new DataStream with the new field of type String added at the end.

    Returns a new DataStream with the new field of type String added at the end. The value of this field for each Row is specified by the default value.

  13. def addFieldFn(name: String, fn: (Row) ⇒ Any, errorIfFieldExists: Boolean): DataStream

  14. def addFieldFn(name: String, fn: (Row) ⇒ Any): DataStream

  15. def addFieldFn(field: Field, fn: (Row) ⇒ Any, errorIfFieldExists: Boolean): DataStream

  16. def addFieldFn(field: Field, fn: (Row) ⇒ Any): DataStream

    Returns a new DataStream with a new field added at the end.

    Returns a new DataStream with a new field added at the end. The value for the field is taken from the function which is invoked for each row.

  17. def aggregated(): GroupedDataStream

  18. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  19. def cartesian(other: DataStream): DataStream

    Returns a new DataStream which is the result of joining every row in this datastream with every row in the given datastream.

    Returns a new DataStream which is the result of joining every row in this datastream with every row in the given datastream.

    The given datastream will be materialized before it is used.

    For example, if this datastream has rows [a,b], [c,d] and [e,f] and the given datastream has [1,2] and [3,4] then the result will be [a,b,1,2], [a,b,3,4], [c,d,1,2], [c,d,3,4], [e,f,1,2] and [e,f,3,4].

  20. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  21. def collect: Vector[Row]

    Action which results in all the rows being returned in memory as a Vector.

  22. def collectValues: Vector[Seq[Any]]

  23. def concat(other: DataStream): DataStream

    Combines two datastreams together such that the fields from this datastream are joined with the fields of the given datastream.

    Combines two datastreams together such that the fields from this datastream are joined with the fields of the given datastream. Eg, if this datastream has fields A,B and the given datastream has fields C,D then the result will have fields A,B,C,D

    This operation requires an executor, as it must buffer rows to ensure an even distribution.

  24. def count: Long

  25. def drop(n: Int): DataStream

  26. def dropField(fieldName: String, caseSensitive: Boolean = true): DataStream

  27. def dropFieldIfExists(fieldName: String, caseSensitive: Boolean = true): DataStream

  28. def dropFields(regex: Regex): DataStream

  29. def dropNullRows(): DataStream

  30. def dropWhile(fieldName: String, p: (Any) ⇒ Boolean): DataStream

  31. def dropWhile(p: (Row) ⇒ Boolean): DataStream

  32. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  33. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  34. def exists(p: (Row) ⇒ Boolean): Boolean

  35. def explode(fn: (Row) ⇒ Seq[Row]): DataStream

  36. def filter(expression: Equals): DataStream

  37. def filter(fieldName: String, p: (Any) ⇒ Boolean): DataStream

    Filters where the given field name matches the given predicate.

  38. def filter(f: (Row) ⇒ Boolean): DataStream

  39. def filterNot(p: (Row) ⇒ Boolean): DataStream

  40. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  41. def find(p: (Row) ⇒ Boolean): Option[Row]

  42. def foreach[U](fn: (Row) ⇒ U): DataStream

    Execute a side effecting function for every row in the stream, returning the same row.

  43. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  44. def groupBy(fn: (Row) ⇒ Any): GroupedDataStream

  45. def groupBy(fields: Iterable[String]): GroupedDataStream

  46. def groupBy(first: String, rest: String*): GroupedDataStream

  47. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  48. def head: Row

  49. def intersection(stream: DataStream): DataStream

  50. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  51. def iterator: Iterator[Row]

  52. def join(key: String, other: DataStream): DataStream

    Joins the given datastream to this datastream on the given key column, where the values of the keys are equal as taken by the scala == operator.

    Joins the given datastream to this datastream on the given key column, where the values of the keys are equal as taken by the scala == operator. Both datastreams must contain the key column.

    The given datastream is fully inflated when this datastream needs to be materialized. For that reason, always use the smallest datastream as the parameter, and the larger datastream as the receiver.

  53. def listener(_listener: Listener): DataStream

  54. val logger: Logger

    Attributes
    protected
    Definition Classes
    Logging
  55. def map(f: (Row) ⇒ Row): DataStream

  56. def mapField(fieldName: String, fn: (Any) ⇒ Any): DataStream

  57. def mapFieldIfExists(fieldName: String, fn: (Any) ⇒ Any): DataStream

  58. def maxBy[T](fn: (Row) ⇒ T)(implicit ordering: Ordering[T]): Row

  59. def minBy[T](fn: (Row) ⇒ T)(implicit ordering: Ordering[T]): Row

  60. def multiplex(count: Int): Seq[DataStream]

  61. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  62. final def notify(): Unit

    Definition Classes
    AnyRef
  63. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  64. def projection(fields: Seq[String]): DataStream

    Returns a new DataStream which contains the given list of fields from the existing stream.

  65. def projection(first: String, rest: String*): DataStream

  66. def projectionExpression(expr: String): DataStream

  67. def removeField(fieldName: String, caseSensitive: Boolean = true): DataStream

  68. def removeFieldIfExists(fieldName: String, caseSensitive: Boolean = true): DataStream

  69. def removeFields(regex: Regex): DataStream

  70. def renameField(nameFrom: String, nameTo: String): DataStream

  71. def replace(from: String, target: Any): DataStream

  72. def replace(fieldName: String, from: String, target: Any, errorIfUnknownField: Boolean = true): DataStream

  73. def replace(fieldName: String, from: String, target: Any): DataStream

  74. def replace(fieldName: String, fn: (Any) ⇒ Any, errorIfUnknownField: Boolean): DataStream

  75. def replace(fieldName: String, fn: (Any) ⇒ Any): DataStream

  76. def replaceField(name: String, field: Field): DataStream

  77. def replaceFieldType(regex: Regex, datatype: DataType): DataStream

  78. def replaceFieldType(from: DataType, to: DataType): DataStream

  79. def replaceFieldType(fieldName: String, datatype: DataType): DataStream

    Returns the same data but with an updated schema.

    Returns the same data but with an updated schema. The field that matches the given name will have its datatype set to the given datatype.

  80. def replaceNullValues(defaultValue: String): DataStream

  81. def sample(k: Int): DataStream

    Returns a new DataStream where only each "k" row is retained.

    Returns a new DataStream where only each "k" row is retained. Ie, if sample is 2, then on average, every other row will be returned. If sample is 10 then only 10% of rows will be returned. When running concurrently, the rows that are sampled will vary depending on the ordering that the workers pull through the rows. Each partition uses its own couter.

  82. def size: Long

  83. def stripCharsFromFieldNames(chars: Seq[Char]): DataStream

    Returns a new DataStream with the same data as this stream, but where the field names have been sanitized by removing any occurances of the given characters.

  84. def substract(stream: DataStream): DataStream

  85. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  86. def take(n: Int): DataStream

  87. def takeWhile(fieldName: String, p: (Any) ⇒ Boolean): DataStream

  88. def takeWhile(p: (Row) ⇒ Boolean): DataStream

  89. def tee(schema: StructType, fn: (Row) ⇒ Seq[Row]): (DataStream, DataStream)

    Invoking this method returns two DataStreams.

    Invoking this method returns two DataStreams. The first is the original datastream which will continue as is. The second is a DataStream which is fed by rows generated from the given function. The function is invoked for each row that passes through this stream.

    Cancellation requests in the tee'd datastream do not propagate back to the original stream.

  90. def to(sink: Sink, parallelism: Int): Long

  91. def to(sink: Sink): Long

  92. def toDataTable: DataTable

  93. def toSet: Set[Row]

  94. def toString(): String

    Definition Classes
    AnyRef → Any
  95. def toVector: Vector[Row]

    Action which results in all the rows being returned in memory as a Vector.

    Action which results in all the rows being returned in memory as a Vector. Alias for 'collect()'

  96. def union(other: DataStream): DataStream

  97. def update(from: String, target: Any): DataStream

    For each row, any values that match "from" will be replaced with "target".

    For each row, any values that match "from" will be replaced with "target". This operation applies to all fields for all rows.

  98. def update(fieldName: String, from: String, target: Any, errorIfUnknownField: Boolean = true): DataStream

  99. def update(fieldName: String, from: String, target: Any): DataStream

    Replaces any values that match "form" with the value "target".

    Replaces any values that match "form" with the value "target". This operation only applies to the field name specified.

  100. def update(fieldName: String, fn: (Any) ⇒ Any, errorIfUnknownField: Boolean): DataStream

  101. def update(fieldName: String, fn: (Any) ⇒ Any): DataStream

    For each row, the value corresponding to the given fieldName is applied to the function.

    For each row, the value corresponding to the given fieldName is applied to the function. The result of the function is the new value for that cell.

  102. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  103. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  104. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  105. def withLowerCaseSchema(): DataStream

Deprecated Value Members

  1. def addField(name: String, fn: (Row) ⇒ Any, errorIfFieldExists: Boolean): DataStream

    Annotations
    @deprecated
    Deprecated

    (Since version 1.3.0) Use addFieldFn for better type inference

  2. def addField(name: String, fn: (Row) ⇒ Any): DataStream

    Returns a new DataStream with a new field added at the end.

    Returns a new DataStream with a new field added at the end. The datatype for the field is assumed to be String. The value for the field is taken from the function which is invoked for each row.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.3.0) Use addFieldFn for better type inference

  3. def addField(field: Field, fn: (Row) ⇒ Any, errorIfFieldExists: Boolean): DataStream

    Annotations
    @deprecated
    Deprecated

    (Since version 1.3.0) use addFieldFn

  4. def addField(field: Field, fn: (Row) ⇒ Any): DataStream

    Annotations
    @deprecated
    Deprecated

    (Since version 1.3.0) use addFieldFn

  5. def addFieldIfNotExists(field: Field, defaultValue: Any): DataStream

    Annotations
    @deprecated
    Deprecated

    (Since version 1.3.0) use addField with errorIfFieldExists = false

  6. def addFieldIfNotExists(name: String, defaultValue: Any): DataStream

    Annotations
    @deprecated
    Deprecated

    (Since version 1.3.0) use addField with errorIfFieldExists = false

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped