



package extra

  1. Public
  2. All

Value Members

  1. object Breeze


    Utilities for Breeze.

    Utilities for Breeze.

    Includes Semigroups for Breeze data types like DenseVectors and DenseMatrixs.

    import com.spotify.scio.extra.Breeze._
    val vectors: SCollection[DenseVector[Double]] = // ...
    vectors.sum  // implicit Semigroup[T]
  2. object Collections


    Utilities for Scala collection library.

    Utilities for Scala collection library.

    Adds a top method to Array[T] and Iterable[T] and a topByKey method to Array[(K, V)] and Iterable[(K, V)].

    import com.spotify.scio.extra.Collections._
    val xs: Array[(String, Int)] = // ...
  3. object Iterators


    Utilities for Scala iterators.

    Utilities for Scala iterators.

    Adds a timeSeries method to Iterator[T] so that it can be windowed with different logic.

    import com.spotify.scio.extra.Iterators._
    case class Event(user: String, action: String, timestamp: Long)
    val i: Iterator[Event] = // ...
    // 60 minutes fixed windows offset by 30 minutes
    // E.g. minute [30, 90), [90, 120), [120, 150), [150, 180) ...
    i.timeSeries(_.timestamp).fixed(3600000, 1800000)
    // session windows with 60 minute gaps between windows
    // 60 minutes sliding windows, one every 10 minutes, offset by 5 minutes
    // E.g. minute [5, 65), [15, 75), [25, 85), [35, 95) ...
    i.timeSeries(_.timestamp).session(3600000, 600000, 300000)
  4. package annoy


    Main package for Annoy side input APIs.

    Main package for Annoy side input APIs. Import all.

    import com.spotify.scio.extra.annoy._

    Two metrics are available, Angular and Euclidean.

    To save an SCollection[(Int, Array[Float])] to an Annoy file:

    val s = sc.parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))

    Save to a temporary location:

    val s1 = s.asAnnoy(Angular, 40, 10)

    Save to a specific location:

    val s1 = s.asAnnoy(Angular, 40, 10, "gs:///")

    SCollection[AnnoyUri] can be converted into a side input:

    val s = sc.parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))
    val side = s.asAnnoySideInput(metric, dimension, numTrees)

    There's syntactic sugar for saving an SCollection and converting it to a side input:

    val s = sc
      .parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))
      .asAnnoySideInput(metric, dimension, numTrees)

    An existing Annoy file can be converted to a side input directly:

    sc.annoySideInput(metric, dimension, numTrees, "gs:///")

    AnnoyReader provides nearest neighbor lookups by vector as well as item lookups:

    val data = (0 until 1000).map(x => (x, Array.fill(40)(r.nextFloat())))
    val main = sc.parallelize(data)
    val side = main.asAnnoySideInput(metric, dimension, numTrees)
      .map { (i, s) =>
        val annoyReader = s(side)
        // get vector by item id, allocating a new Array[Float] each time
        val v1 = annoyReader.getItemVector(i)
        // get vector by item id, copy vector into pre-allocated Array[Float]
        val v2 = Array.fill(dim)(-1.0f)
        annoyReader.getItemVector(i, v2)
        // get 10 nearest neighbors by vector
        val results = annoyReader.getNearest(v2, 10)
  5. package checkpoint


    Main package for checkpoint API.

    Main package for checkpoint API. Import all.

    import com.spotify.scio.extra.checkpoint._
  6. package json


    Main package for JSON APIs.

    Main package for JSON APIs. Import all.

    This package uses Circe for JSON handling under the hood.

    import com.spotify.scio.extra.json._
    // define a type-safe JSON schema
    case class Record(i: Int, d: Double, s: String)
    // read JSON as case classes
    // write case classes as JSON
    sc.parallelize((1 to 10).map(x => Record(x, x.toDouble, x.toString))
  7. package libsvm


    Main package for reading the Lib SVM Format

    Main package for reading the Lib SVM Format

    import com.spotify.scio.extra.libsvm._
    // Read SVM Lib as Label, SparseVector
  8. package nn

  9. package sparkey


    Main package for Sparkey side input APIs.

    Main package for Sparkey side input APIs. Import all.

    import com.spotify.scio.extra.sparkey._

    To save an SCollection[(String, String)] to a Sparkey file:

    val s = sc.parallelize(Seq("a" -> "one", "b" -> "two"))
    // temporary location
    val s1: SCollection[SparkeyUri] = s.asSparkey
    // specific location
    val s1: SCollection[SparkeyUri] = s.asSparkey("gs:////")

    The result SCollection[SparkeyUri] can be converted to a side input:

    val s: SCollection[SparkeyUri] = sc.parallelize(Seq("a" -> "one", "b" -> "two")).asSparkey
    val side: SideInput[SparkeyReader] = s.asSparkeySideInput

    These two steps can be done with a syntactic sugar:

    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))

    An existing Sparkey file can also be converted to a side input directly:


    SparkeyReader can be used like a lookup table in a side input operation:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .map { (x, s) =>
        s(side).getOrElse(x, "unknown")
