Performs an aggregation over all Rows in this RDD.
Performs an aggregation over all Rows in this RDD. This is equivalent to a groupBy with no grouping expressions.
schemaRDD.aggregate(Sum('sales) as 'totalSales)
Applies a qualifier to the attributes of this relation.
Applies a qualifier to the attributes of this relation. Can be used to disambiguate attributes with the same name, for example, when performing self-joins.
val x = schemaRDD.where('a === 1).as('x) val y = schemaRDD.where('a === 2).as('y) x.join(y).where("x.a".attr === "y.a".attr),
:: Experimental :: Return the number of elements in the RDD.
:: Experimental :: Return the number of elements in the RDD. Unlike the base RDD implementation of count, this implementation leverages the query optimizer to compute the count on the SchemaRDD, which supports features such as filter pushdown.
:: Experimental :: Applies the given Generator, or table generating function, to this relation.
:: Experimental :: Applies the given Generator, or table generating function, to this relation.
A table generating function. The API for such functions is likely to change in future releases
when set to true, each output row of the generator is joined with the input row that produced it.
when set to true, at least one row will be produced for each input row, similar to
an OUTER JOIN
in SQL. When no output rows are produced by the generator for a
given row, a single row will be output, with NULL
values for each of the
generated columns.
an optional alias that can be used as qualifier for the attributes that are produced by this generate operation.
Performs a grouping followed by an aggregation.
Performs a grouping followed by an aggregation.
schemaRDD.groupBy('year)(Sum('sales) as 'totalSales)
:: Experimental :: Appends the rows from this RDD to the specified table.
:: Experimental :: Appends the rows from this RDD to the specified table.
:: Experimental :: Adds the rows from this RDD to the specified table, optionally overwriting the existing data.
:: Experimental :: Adds the rows from this RDD to the specified table, optionally overwriting the existing data.
Performs a relational join on two SchemaRDDs
Performs a relational join on two SchemaRDDs
the SchemaRDD that should be joined with this one.
One of Inner
, LeftOuter
, RightOuter
, or FullOuter
. Defaults to Inner.
An optional condition for the join operation. This is equivalent to the ON
clause in standard SQL. In the case of Inner
joins, specifying a
condition
is equivalent to adding where
clauses after the join
.
Limits the results by the given integer.
Limits the results by the given integer.
schemaRDD.limit(10)
Sorts the results by the given expressions.
Sorts the results by the given expressions.
schemaRDD.orderBy('a) schemaRDD.orderBy('a, 'b) schemaRDD.orderBy('a.asc, 'b.desc)
Prints out the schema in the tree format.
Prints out the schema in the tree format.
:: DeveloperApi :: A lazily computed query execution workflow.
:: DeveloperApi :: A lazily computed query execution workflow. All other RDD operations are passed through to the RDD that is produced by this workflow. This workflow is produced lazily because invoking the whole query optimization pipeline can be expensive.
The query execution is considered a Developer API as phases may be added or removed in future releases. This execution is only exposed to provide an interface for inspecting the various phases for debugging purposes. Applications should not depend on particular phases existing or producing any specific output, even for exactly the same query.
Additionally, the RDD exposed by this execution is not designed for consumption by end users. In particular, it does not contain any schema information, and it reuses Row objects internally. This object reuse improves performance, but can make programming against the RDD more difficult. Instead end users should perform RDD operations on a SchemaRDD directly.
Registers this RDD as a temporary table using the given name.
Registers this RDD as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this SchemaRDD.
:: Experimental :: Returns a sampled version of the underlying dataset.
:: Experimental :: Returns a sampled version of the underlying dataset.
Saves the contents of this SchemaRDD
as a parquet file, preserving the schema.
Saves the contents of this SchemaRDD
as a parquet file, preserving the schema. Files that
are written out using this method can be read back in as a SchemaRDD using the parquetFile
function.
:: Experimental :: Creates a table from the the contents of this SchemaRDD.
:: Experimental :: Creates a table from the the contents of this SchemaRDD. This will fail if the table already exists.
Note that this currently only works with SchemaRDDs that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
Returns the output schema in the tree format.
Returns the output schema in the tree format.
Changes the output of this relation to the given expressions, similar to the SELECT
clause
in SQL.
Changes the output of this relation to the given expressions, similar to the SELECT
clause
in SQL.
schemaRDD.select('a, 'b + 'c, 'd as 'aliasedName)
a set of logical expression that will be evaluated for each input row.
Returns this RDD as a JavaSchemaRDD.
Returns this RDD as a SchemaRDD.
Returns this RDD as a SchemaRDD. Intended primarily to force the invocation of the implicit conversion from a standard RDD to a SchemaRDD.
Combines the tuples of two RDDs with the same schema, keeping duplicates.
:: Experimental ::
Filters tuples using a function over a Dynamic
version of a given Row.
:: Experimental ::
Filters tuples using a function over a Dynamic
version of a given Row. DynamicRows use
scala's Dynamic trait to emulate an ORM of in a dynamically typed language. Since the type of
the column is not known at compile time, all attributes are converted to strings before
being passed to the function.
schemaRDD.where(r => r.firstName == "Bob" && r.lastName == "Smith")
Filters tuples using a function over the value of the specified column.
Filters tuples using a function over the value of the specified column.
schemaRDD.sfilter('a)((a: Int) => ...)
Filters the output, only returning those rows where condition
evaluates to true.
Filters the output, only returning those rows where condition
evaluates to true.
schemaRDD.where('a === 'b) schemaRDD.where('a === 1) schemaRDD.where('a + 'b > 10)
(Since version 1.0.0) use mapPartitionsWithIndex and filter
(Since version 1.0.0) use mapPartitionsWithIndex and flatMap
(Since version 1.0.0) use mapPartitionsWithIndex and foreach
(Since version 1.1.0) use limit with integer argument
(Since version 0.7.0) use mapPartitionsWithIndex
(Since version 1.0.0) use mapPartitionsWithIndex
(Since version 1.0.0) use collect
Functions that create new queries from SchemaRDDs. The result of all query functions is also a SchemaRDD, allowing multiple operations to be chained using a builder pattern.
:: AlphaComponent :: An RDD of Row objects that has an associated schema. In addition to standard RDD functions, SchemaRDDs can be used in relational queries, as shown in the examples below.
Importing a SQLContext brings an implicit into scope that automatically converts a standard RDD whose elements are scala case classes into a SchemaRDD. This conversion can also be done explicitly using the
createSchemaRDD
function on a SQLContext.A
SchemaRDD
can also be created by loading data in from external sources. Examples are loading data from Parquet files by using by using theparquetFile
method on SQLContext, and loading JSON datasets by usingjsonFile
andjsonRDD
methods on SQLContext.SQL Queries
A SchemaRDD can be registered as a table in the SQLContext that was used to create it. Once an RDD has been registered as a table, it can be used in the FROM clause of SQL statements.
Language Integrated Queries