com.spotify.scio.values

SCollection

sealed trait SCollection[T] extends PCollectionWrapper[T]

A Scala wrapper for PCollection. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all SCollections, such as map, filter, and persist. In addition, PairSCollectionFunctions contains operations available only on SCollections of key-value pairs, such as groupByKey and join; DoubleSCollectionFunctions contains operations available only on SCollections of Doubles.

Linear Supertypes
PCollectionWrapper[T], AnyRef, Any
Ordering
  1. Grouped
  2. Alphabetic
  3. By inheritance
Inherited
  1. SCollection
  2. PCollectionWrapper
  3. AnyRef
  4. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Abstract Value Members

  1. abstract val context: ScioContext

    The ScioContext associated with this PCollection.

    The ScioContext associated with this PCollection.

    Definition Classes
    PCollectionWrapper
  2. implicit abstract val ct: ClassTag[T]

    Attributes
    protected
    Definition Classes
    PCollectionWrapper
  3. abstract val internal: PCollection[T]

    The PCollection being wrapped internally.

    The PCollection being wrapped internally.

    Definition Classes
    PCollectionWrapper

Concrete Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. def ++(that: SCollection[T]): SCollection[T]

    Return the union of this SCollection and another one.

    Return the union of this SCollection and another one. Any identical elements will appear multiple times (use .distinct() to eliminate them).

  5. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  6. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  7. def aggregate[A, U](aggregator: Aggregator[T, A, U])(implicit arg0: ClassTag[A], arg1: ClassTag[U]): SCollection[U]

    Aggregate with Aggregator.

    Aggregate with Aggregator. First each item T is mapped to A, then we reduce with a semigroup of A, then finally we present the results as U. This could be more powerful and better optimized in some cases.

  8. def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): SCollection[U]

    Aggregate the elements using given combine functions and a neutral "zero value".

    Aggregate the elements using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of this SCollection, T. Thus, we need one operation for merging a T into an U and one operation for merging two U's. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.

  9. def applyTransform[U](transform: PTransform[_ >: PCollection[T], PCollection[U]])(implicit arg0: ClassTag[U]): SCollection[U]

    Apply a PTransform and wrap the output in an SCollection.

  10. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  11. def asIterableSideInput: SideInput[Iterable[T]]

    Convert this SCollection to a SideInput, mapping each window to an Iterable, to be used with SCollection.withSideInputs.

    Convert this SCollection to a SideInput, mapping each window to an Iterable, to be used with SCollection.withSideInputs.

    The values of the Iterable for a window are not required to fit in memory, but they may also not be effectively cached. If it is known that every window fits in memory, and stronger caching is desired, use asListSideInput.

  12. def asListSideInput: SideInput[List[T]]

    Convert this SCollection to a SideInput, mapping each window to a List, to be used with SCollection.withSideInputs.

  13. def asSingletonSideInput: SideInput[T]

    Convert this SCollection of a single value per window to a SideInput, to be used with SCollection.withSideInputs.

  14. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  15. def collect[U](pfn: PartialFunction[T, U])(implicit arg0: ClassTag[U]): SCollection[U]

    Filter the elements for which the given PartialFunction is defined, and then map.

  16. def combine[C](createCombiner: (T) ⇒ C)(mergeValue: (C, T) ⇒ C)(mergeCombiners: (C, C) ⇒ C)(implicit arg0: ClassTag[C]): SCollection[C]

    Generic function to combine the elements using a custom set of aggregation functions.

    Generic function to combine the elements using a custom set of aggregation functions. Turns an SCollection[T] into a result of type SCollection[C], for a "combined type" C. Note that V and C can be different -- for example, one might combine an SCollection of type Int into an SCollection of type Seq[Int]. Users provide three functions:

    - createCombiner, which turns a V into a C (e.g., creates a one-element list)

    - mergeValue, to merge a V into a C (e.g., adds it to the end of a list)

    - mergeCombiners, to combine two C's into a single one.

  17. def count: SCollection[Long]

    Count the number of elements in the SCollection.

    Count the number of elements in the SCollection.

    returns

    a new SCollection with the count

  18. def countApproxDistinct(maximumEstimationError: Double = 0.02): SCollection[Long]

    Count approximate number of distinct elements in the SCollection.

    Count approximate number of distinct elements in the SCollection.

    maximumEstimationError

    the maximum estimation error, which should be in the range [0.01, 0.5]

  19. def countApproxDistinct(sampleSize: Int): SCollection[Long]

    Count approximate number of distinct elements in the SCollection.

    Count approximate number of distinct elements in the SCollection.

    sampleSize

    the number of entries in the statisticalsample; the higher this number, the more accurate the estimate will be; should be >= 16

  20. def countByValue: SCollection[(T, Long)]

    Count of each unique value in this SCollection as an SCollection of (value, count) pairs.

  21. def cross[U](that: SCollection[U])(implicit arg0: ClassTag[U]): SCollection[(T, U)]

    Return the cross product with another SCollection by replicating that to all workers.

    Return the cross product with another SCollection by replicating that to all workers. The right side should be tiny and fit in memory.

  22. def debug(out: () ⇒ PrintStream = () => Console.out, prefix: String = ""): SCollection[T]

    Print content of a SCollection to out().

  23. def distinct: SCollection[T]

    Return a new SCollection containing the distinct elements in this SCollection.

  24. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  25. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  26. def filter(f: (T) ⇒ Boolean): SCollection[T]

    Return a new SCollection containing only the elements that satisfy a predicate.

  27. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  28. def flatMap[U](f: (T) ⇒ TraversableOnce[U])(implicit arg0: ClassTag[U]): SCollection[U]

    Return a new SCollection by first applying a function to all elements of this SCollection, and then flattening the results.

  29. def fold(implicit mon: Monoid[T]): SCollection[T]

    Fold with Monoid, which defines the associative function and "zero value" for T.

    Fold with Monoid, which defines the associative function and "zero value" for T. This could be more powerful and better optimized in some cases.

  30. def fold(zeroValue: T)(op: (T, T) ⇒ T): SCollection[T]

    Aggregate the elements using a given associative function and a neutral "zero value".

    Aggregate the elements using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

  31. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  32. def groupBy[K](f: (T) ⇒ K)(implicit arg0: ClassTag[K]): SCollection[(K, Iterable[T])]

    Return an SCollection of grouped items.

    Return an SCollection of grouped items. Each group consists of a key and a sequence of elements mapping to that key. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting SCollection is evaluated.

    Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairSCollectionFunctions.aggregateByKey or PairSCollectionFunctions.reduceByKey will provide much better performance.

  33. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  34. def hashLookup[V](that: SCollection[(T, V)])(implicit arg0: ClassTag[V]): SCollection[(T, Iterable[V])]

    Look up values in a SCollection[(T, V)] for each element T in this SCollection by replicating that to all workers.

    Look up values in a SCollection[(T, V)] for each element T in this SCollection by replicating that to all workers. The right side should be tiny and fit in memory.

  35. def intersection(that: SCollection[T]): SCollection[T]

    Return the intersection of this SCollection and another one.

    Return the intersection of this SCollection and another one. The output will not contain any duplicate elements, even if the input SCollections did.

    Note that this method performs a shuffle internally.

  36. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  37. def keyBy[K](f: (T) ⇒ K)(implicit arg0: ClassTag[K]): SCollection[(K, T)]

    Create tuples of the elements in this SCollection by applying f.

  38. def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): SCollection[U]

    Return a new SCollection by applying a function to all elements of this SCollection.

  39. def materialize: Future[Tap[T]]

    Extract data from this SCollection as a Future.

    Extract data from this SCollection as a Future. The Future will be completed once the pipeline completes successfully.

  40. def max(implicit ord: Ordering[T]): SCollection[T]

    Return the max of this SCollection as defined by the implicit Ordering[T].

    Return the max of this SCollection as defined by the implicit Ordering[T].

    returns

    a new SCollection with the maximum element

  41. def mean(implicit ev: Numeric[T]): SCollection[Double]

    Return the mean of this SCollection as defined by the implicit Numeric[T].

    Return the mean of this SCollection as defined by the implicit Numeric[T].

    returns

    a new SCollection with the mean of elements

  42. def min(implicit ord: Ordering[T]): SCollection[T]

    Return the min of this SCollection as defined by the implicit Ordering[T].

    Return the min of this SCollection as defined by the implicit Ordering[T].

    returns

    a new SCollection with the minimum element

  43. def name: String

    A friendly name for this SCollection.

  44. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  45. final def notify(): Unit

    Definition Classes
    AnyRef
  46. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  47. def pApply[U](transform: PTransform[_ >: PCollection[T], PCollection[U]])(implicit arg0: ClassTag[U]): SCollection[U]

    Attributes
    protected
    Definition Classes
    PCollectionWrapper
  48. def partition(numPartitions: Int, f: (T) ⇒ Int): Seq[SCollection[T]]

    Partition this SCollection with the provided function.

    Partition this SCollection with the provided function.

    numPartitions

    number of output partitions

    f

    function that assigns an output partition to each element, should be in the range [0, numPartitions - 1]

    returns

    partitioned SCollections in a Seq

  49. def quantilesApprox(numQuantiles: Int)(implicit ord: Ordering[T]): SCollection[Iterable[T]]

    Compute the SCollection's data distribution using approximate N-tiles.

    Compute the SCollection's data distribution using approximate N-tiles.

    returns

    a new SCollection whose single value is an Iterable of the approximate N-tiles of the elements

  50. def randomSplit(weights: Array[Double]): Array[SCollection[T]]

    Randomly splits this SCollection with the provided weights.

    Randomly splits this SCollection with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1

    returns

    split SCollections in an array

  51. def reduce(op: (T, T) ⇒ T): SCollection[T]

    Reduce the elements of this SCollection using the specified commutative and associative binary operator.

  52. def sample(withReplacement: Boolean, fraction: Double): SCollection[T]

    Return a sampled subset of this SCollection.

  53. def sample(sampleSize: Int): SCollection[Iterable[T]]

    Return a sampled subset of this SCollection.

    Return a sampled subset of this SCollection.

    returns

    a new SCollection whose single value is an Iterable of the samples

  54. def saveAsAvroFile(path: String, numShards: Int = 0, schema: Schema = null, suffix: String = "", codec: CodecFactory = CodecFactory.deflateCodec(6), metadata: Map[String, AnyRef] = Map.empty): Future[Tap[T]]

    Save this SCollection as an Avro file.

    Save this SCollection as an Avro file.

    schema

    must be not null if T is of type GenericRecord.

  55. def saveAsBigQuery(tableSpec: String, schema: TableSchema = null, writeDisposition: WriteDisposition = null, createDisposition: CreateDisposition = null)(implicit ev: <:<[T, TableRow]): Future[Tap[TableRow]]

    Save this SCollection as a BigQuery table.

    Save this SCollection as a BigQuery table. Note that elements must be of type TableRow.

  56. def saveAsBigQuery(table: TableReference, schema: TableSchema, writeDisposition: WriteDisposition, createDisposition: CreateDisposition)(implicit ev: <:<[T, TableRow]): Future[Tap[TableRow]]

    Save this SCollection as a BigQuery table.

    Save this SCollection as a BigQuery table. Note that elements must be of type TableRow.

  57. def saveAsCustomOutput(transform: PTransform[PCollection[T], PDone]): Future[Tap[T]]

    Save this SCollection with a custom output transform.

    Save this SCollection with a custom output transform. The transform should have a unique name.

  58. def saveAsDatastore(projectId: String)(implicit ev: <:<[T, Entity]): Future[Tap[Entity]]

    Save this SCollection as a Datastore dataset.

    Save this SCollection as a Datastore dataset. Note that elements must be of type Entity.

  59. def saveAsObjectFile(path: String, numShards: Int = 0, suffix: String = ".obj", metadata: Map[String, AnyRef] = Map.empty): Future[Tap[T]]

    Save this SCollection as an object file using default serialization.

  60. def saveAsProtobufFile(path: String, numShards: Int = 0)(implicit ev: <:<[T, Message]): Future[Tap[T]]

    Save this SCollection as a Protobuf file.

  61. def saveAsPubsub(topic: String)(implicit ev: <:<[T, String]): Future[Tap[String]]

    Save this SCollection as a Pub/Sub topic.

  62. def saveAsTableRowJsonFile(path: String, numShards: Int = 0)(implicit ev: <:<[T, TableRow]): Future[Tap[TableRow]]

    Save this SCollection as a JSON text file.

    Save this SCollection as a JSON text file. Note that elements must be of type TableRow.

  63. def saveAsTextFile(path: String, suffix: String = ".txt", numShards: Int = 0): Future[Tap[String]]

    Save this SCollection as a text file.

    Save this SCollection as a text file. Note that elements must be of type String.

  64. def saveAsTfRecordFile(path: String, suffix: String = ".tfrecords", tfRecordOptions: TFRecordOptions = TFRecordOptions.writeDefault)(implicit ev: <:<[T, Array[Byte]]): Future[Tap[Array[Byte]]]

    Save this SCollection as a TensorFlow TFRecord file.

    Save this SCollection as a TensorFlow TFRecord file. Note that elements must be of type ArrayByte.

  65. def setCoder(coder: Coder[T]): SCollection[T]

    Assign a Coder to this SCollection.

  66. def setName(name: String): SCollection[T]

    Assign a name to this SCollection.

  67. def subtract(that: SCollection[T]): SCollection[T]

    Return an SCollection with the elements from this that are not in other.

  68. def sum(implicit sg: Semigroup[T]): SCollection[T]

    Reduce with Semigroup.

    Reduce with Semigroup. This could be more powerful and better optimized in some cases.

  69. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  70. def take(num: Long): SCollection[T]

    Return a sampled subset of any num elements of the SCollection.

  71. def timestampBy(f: (T) ⇒ Instant, allowedTimestampSkew: Duration = Duration.ZERO): SCollection[T]

    Assign timestamps to values.

    Assign timestamps to values. With a optional skew

  72. def toString(): String

    Definition Classes
    AnyRef → Any
  73. def toWindowed: WindowedSCollection[T]

    Convert this SCollection to an WindowedSCollection.

  74. def top(num: Int)(implicit ord: Ordering[T]): SCollection[Iterable[T]]

    Return the top k (largest) elements from this SCollection as defined by the specified implicit Ordering[T].

    Return the top k (largest) elements from this SCollection as defined by the specified implicit Ordering[T].

    returns

    a new SCollection whose single value is an Iterable of the top k

  75. def union(that: SCollection[T]): SCollection[T]

    Return the union of this SCollection and another one.

    Return the union of this SCollection and another one. Any identical elements will appear multiple times (use .distinct() to eliminate them).

  76. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  77. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  78. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  79. def windowByDays(number: Int, options: WindowOptions[IntervalWindow] = WindowOptions()): SCollection[T]

    Window values into by days.

  80. def windowByMonths(number: Int, options: WindowOptions[IntervalWindow] = WindowOptions()): SCollection[T]

    Window values into by months.

  81. def windowByWeeks(number: Int, startDayOfWeek: Int, options: WindowOptions[IntervalWindow] = WindowOptions()): SCollection[T]

    Window values into by weeks.

  82. def windowByYears(number: Int, options: WindowOptions[IntervalWindow] = WindowOptions()): SCollection[T]

    Window values into by years.

  83. def withAccumulator(acc: Accumulator[_]*): SCollectionWithAccumulator[T]

    Convert this SCollection to an SCollectionWithAccumulator with one or more Accumulators, similar to Hadoop counters.

    Convert this SCollection to an SCollectionWithAccumulator with one or more Accumulators, similar to Hadoop counters. Call SCollectionWithAccumulator.toSCollection when done with accumulators.

    Note that each accumulator may be used in a single scope only.

    Create accumulators with ScioContext.maxAccumulator, ScioContext.minAccumulator or ScioContext.sumAccumulator. For example:

    val maxLineLength = sc.maxAccumulator[Int]("maxLineLength")
    val minLineLength = sc.maxAccumulator[Int]("maxLineLength")
    val emptyLines = sc.maxAccumulator[Long]("emptyLines")
    
    val p: SCollection[String] = // ...
    p
      .withAccumulators(maxLineLength, minLineLength, emptyLines)
      .filter { (l, c) =>
        val t = l.strip()
        c.addValue(maxLineLength, t.length).addValue(minLineLength, t.length)
        val b = t.isEmpty
        if (b) c.addValue(emptyLines, 1L)
        !b
      }
      .toSCollection
  84. def withFanout(fanout: Int): SCollectionWithFanout[T]

    Convert this SCollection to an SCollectionWithFanout that uses an intermediate node to combine parts of the data to reduce load on the final global combine step.

    Convert this SCollection to an SCollectionWithFanout that uses an intermediate node to combine parts of the data to reduce load on the final global combine step.

    fanout

    the number of intermediate keys that will be used

  85. def withFixedWindows(duration: Duration, offset: Duration = Duration.ZERO, options: WindowOptions[IntervalWindow] = WindowOptions()): SCollection[T]

    Window values into fixed windows.

  86. def withGlobalWindow(options: WindowOptions[GlobalWindow] = WindowOptions()): SCollection[T]

    Group values in to a single global window.

  87. def withPaneInfo: SCollection[(T, PaneInfo)]

    Convert values into pairs of (value, window).

  88. def withSessionWindows(gapDuration: Duration, options: WindowOptions[IntervalWindow] = WindowOptions()): SCollection[T]

    Window values based on sessions.

  89. def withSideInputs(sides: SideInput[_]*): SCollectionWithSideInput[T]

    Convert this SCollection to an SCollectionWithSideInput with one or more SideInputs, similar to Spark broadcast variables.

    Convert this SCollection to an SCollectionWithSideInput with one or more SideInputs, similar to Spark broadcast variables. Call SCollectionWithSideInput.toSCollection when done with side inputs.

    Note that the side inputs should be tiny and fit in memory.

    val s1: SCollection[Int] = // ...
    val s2: SCollection[String] = // ...
    val s3: SCollection[(String, Double)] = // ...
    
    // Prepare side inputs
    val side1 = s1.asSingletonSideInput
    val side2 = s2.asIterableSideInput
    val side3 = s3.asMapSideInput
    
    val p: SCollection[MyRecord] = // ...
    p.withSideInputs(side1, side2, side3).map { (x, s) =>
      // Extract side inputs from context
      val s1: Int = s(side1)
      val s2: Iterable[String] = s(side2)
      val s3: Map[String, Iterable[Double]] = s(side3)
      // ...
    }
  90. def withSideOutputs(sides: SideOutput[_]*): SCollectionWithSideOutput[T]

    Convert this SCollection to an SCollectionWithSideOutput with one or more SideOutputs, so that a single transform can write to multiple destinations.

    Convert this SCollection to an SCollectionWithSideOutput with one or more SideOutputs, so that a single transform can write to multiple destinations.

    // Prepare side inputs
    val side1 = SideOutput[String]()
    val side2 = SideOutput[Int]()
    
    val p: SCollection[MyRecord] = // ...
    p.withSideOutputs(side1, side2).map { (x, s) =>
      // Write to side outputs via context
      s.output(side1, "word").output(side2, 1)
      // ...
    }
  91. def withSlidingWindows(size: Duration, period: Duration = Duration.millis(1), offset: Duration = Duration.ZERO, options: WindowOptions[IntervalWindow] = WindowOptions()): SCollection[T]

    Window values into sliding windows.

  92. def withTimestamp: SCollection[(T, Instant)]

    Convert values into pairs of (value, timestamp).

  93. def withWindow: SCollection[(T, BoundedWindow)]

    Convert values into pairs of (value, window).

  94. def withWindowFn[W <: BoundedWindow](fn: WindowFn[AnyRef, W], options: WindowOptions[W] = WindowOptions()): SCollection[T]

    Window values with the given function.

Inherited from PCollectionWrapper[T]

Inherited from AnyRef

Inherited from Any

Collection Operations

debug

Hash Operations

Output Sinks

Side Input and Output Operations

Transformations

Windowing Operations

Ungrouped