Package

com.spotify.scio

extra

Permalink

package extra

Visibility
  1. Public
  2. All

Value Members

  1. object Breeze

    Permalink

    Utilities for Breeze.

    Utilities for Breeze.

    Includes Semigroups for Breeze data types like DenseVectors and DenseMatrixs.

    import com.spotify.scio.extra.Breeze._
    
    val vectors: SCollection[DenseVector[Double]] = // ...
    vectors.sum  // implicit Semigroup[T]
  2. object Collections

    Permalink

    Utilities for Scala collection library.

    Utilities for Scala collection library.

    Adds a top method to Array[T] and Iterable[T] and a topByKey method to Array[(K, V)] and Iterable[(K, V)].

    import com.spotify.scio.extra.Collections._
    
    val xs: Array[(String, Int)] = // ...
    xs.top(5)(Ordering.by(_._2))
    xs.topByKey(5)
  3. object Iterators

    Permalink

    Utilities for Scala iterators.

    Utilities for Scala iterators.

    Adds a timeSeries method to Iterator[T] so that it can be windowed with different logic.

    import com.spotify.scio.extra.Iterators._
    
    case class Event(user: String, action: String, timestamp: Long)
    val i: Iterator[Event] = // ...
    
    // 60 minutes fixed windows offset by 30 minutes
    // E.g. minute [30, 90), [90, 120), [120, 150), [150, 180) ...
    i.timeSeries(_.timestamp).fixed(3600000, 1800000)
    
    // session windows with 60 minute gaps between windows
    i.timeSeries(_.timestamp).session(3600000)
    
    // 60 minutes sliding windows, one every 10 minutes, offset by 5 minutes
    // E.g. minute [5, 65), [15, 75), [25, 85), [35, 95) ...
    i.timeSeries(_.timestamp).session(3600000, 600000, 300000)
  4. package annoy

    Permalink

    Main package for Annoy side input APIs.

    Main package for Annoy side input APIs. Import all.

    import com.spotify.scio.extra.annoy._

    Two metrics are available, Angular and Euclidean.

    To save an SCollection[(Int, Array[Float])] to an Annoy file:

    val s = sc.parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))

    Save to a temporary location:

    val s1 = s.asAnnoy(Angular, 40, 10)

    Save to a specific location:

    val s1 = s.asAnnoy(Angular, 40, 10, "gs:///")

    SCollection[AnnoyUri] can be converted into a side input:

    val s = sc.parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))
    val side = s.asAnnoySideInput(metric, dimension, numTrees)

    There's syntactic sugar for saving an SCollection and converting it to a side input:

    val s = sc
      .parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))
      .asAnnoySideInput(metric, dimension, numTrees)

    An existing Annoy file can be converted to a side input directly:

    sc.annoySideInput(metric, dimension, numTrees, "gs:///")

    AnnoyReader provides nearest neighbor lookups by vector as well as item lookups:

    val data = (0 until 1000).map(x => (x, Array.fill(40)(r.nextFloat())))
    val main = sc.parallelize(data)
    val side = main.asAnnoySideInput(metric, dimension, numTrees)
    
    main.keys.withSideInput(side)
      .map { (i, s) =>
        val annoyReader = s(side)
    
        // get vector by item id, allocating a new Array[Float] each time
        val v1 = annoyReader.getItemVector(i)
    
        // get vector by item id, copy vector into pre-allocated Array[Float]
        val v2 = Array.fill(dim)(-1.0f)
        annoyReader.getItemVector(i, v2)
    
        // get 10 nearest neighbors by vector
        val results = annoyReader.getNearest(v2, 10)
      }
  5. package checkpoint

    Permalink

    Main package for checkpoint API.

    Main package for checkpoint API. Import all.

    import com.spotify.scio.extra.checkpoint._
  6. package json

    Permalink

    Main package for JSON APIs.

    Main package for JSON APIs. Import all.

    This package uses Circe for JSON handling under the hood.

    import com.spotify.scio.extra.json._
    
    // define a type-safe JSON schema
    case class Record(i: Int, d: Double, s: String)
    
    // read JSON as case classes
    sc.jsonFile[Record]("input.json")
    
    // write case classes as JSON
    sc.parallelize((1 to 10).map(x => Record(x, x.toDouble, x.toString))
      .saveAsJsonFile("output")
  7. package libsvm

    Permalink

    Main package for reading the Lib SVM Format

    Main package for reading the Lib SVM Format

    import com.spotify.scio.extra.libsvm._
    
    // Read SVM Lib as Label, SparseVector
    sc.libSVMFile("input.svm")
  8. package nn

    Permalink
  9. package sparkey

    Permalink

    Main package for Sparkey side input APIs.

    Main package for Sparkey side input APIs. Import all.

    import com.spotify.scio.extra.sparkey._

    To save an SCollection[(String, String)] to a Sparkey file:

    val s = sc.parallelize(Seq("a" -> "one", "b" -> "two"))
    
    // temporary location
    val s1: SCollection[SparkeyUri] = s.asSparkey
    
    // specific location
    val s1: SCollection[SparkeyUri] = s.asSparkey("gs:////")

    The result SCollection[SparkeyUri] can be converted to a side input:

    val s: SCollection[SparkeyUri] = sc.parallelize(Seq("a" -> "one", "b" -> "two")).asSparkey
    val side: SideInput[SparkeyReader] = s.asSparkeySideInput

    These two steps can be done with a syntactic sugar:

    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput

    An existing Sparkey file can also be converted to a side input directly:

    sc.sparkeySideInput("gs:////")

    SparkeyReader can be used like a lookup table in a side input operation:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput
    
    main.withSideInputs(side)
      .map { (x, s) =>
        s(side).getOrElse(x, "unknown")
      }

Ungrouped