Compute aggregates by specifying a series of aggregate columns.
Compute aggregates by specifying a series of aggregate columns.
Note that this function by default retains the grouping columns in
its output. To not retain grouping columns, set
spark.sql.retainGroupColumns
to false.
The available aggregate methods are defined in org.apache.spark.sql.functions.
// Selects the age of the oldest employee and the aggregate expense for each department // Scala: import org.apache.spark.sql.functions._ df.groupBy("department").agg(max("age"), sum("expense")) // Java: import static org.apache.spark.sql.functions.*; df.groupBy("department").agg(max("age"), sum("expense"));
Note that before Spark 1.4, the default behavior is to NOT retain
grouping columns. To change to that behavior, set config variable
spark.sql.retainGroupColumns
to false
.
// Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
1.3.0
Compute aggregates by specifying a map from column name to aggregate methods.
Compute aggregates by specifying a map from column name to
aggregate methods. The resulting DataFrame
will also contain the
grouping columns.
The available aggregate methods are avg
, max
, min
, sum
,
count
.
// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg(Map( "age" -> "max", "expense" -> "sum" ))
1.3.0
Compute aggregates by specifying the column names and aggregate methods.
Compute aggregates by specifying the column names and aggregate
methods. The resulting DataFrame
will also contain the grouping
columns.
The available aggregate methods are avg
, max
, min
, sum
,
count
.
// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg( "age" -> "max", "expense" -> "sum" )
1.3.0
Compute the mean value for each numeric columns for each group.
Compute the mean value for each numeric columns for each group. The
resulting DataFrame
will also contain the grouping columns. When
specified columns are given, only compute the mean values for them.
1.3.0
Count the number of rows for each group.
Count the number of rows for each group. The resulting DataFrame
will also contain the grouping columns.
1.3.0
Applies an action to the underlying RelationalGroupedDataset.
Applies an action to the underlying RelationalGroupedDataset, it is used for transformations that can fail due to an AnalysisException.
Compute the max value for each numeric columns for each group.
Compute the max value for each numeric columns for each group. The
resulting DataFrame
will also contain the grouping columns. When
specified columns are given, only compute the max values for them.
1.3.0
Compute the average value for each numeric columns for each group.
Compute the average value for each numeric columns for each group.
This is an alias for avg
. The resulting DataFrame
will also
contain the grouping columns. When specified columns are given,
only compute the average values for them.
1.3.0
Compute the min value for each numeric column for each group.
Compute the min value for each numeric column for each group. The
resulting DataFrame
will also contain the grouping columns. When
specified columns are given, only compute the min values for them.
1.3.0
Pivots a column of the current DataFrame
and performs the
specified aggregation.
Pivots a column of the current DataFrame
and performs the
specified aggregation. This is an overloaded version of the pivot
method with pivotColumn
of the String
type.
// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings")
the column to pivot.
List of values that will be translated to columns in the output DataFrame.
2.4.0
Pivots a column of the current DataFrame
and performs the
specified aggregation.
Pivots a column of the current DataFrame
and performs the
specified aggregation. This is an overloaded version of the pivot
method with pivotColumn
of the String
type.
// Or without specifying column values (less efficient) df.groupBy($"year").pivot($"course").sum($"earnings");
he column to pivot.
2.4.0
Pivots a column of the current DataFrame
and performs the
specified aggregation.
Pivots a column of the current DataFrame
and performs the
specified aggregation. There are two versions of pivot function:
one that requires the caller to specify the list of distinct values
to pivot on, and one that does not. The latter is more concise but
less efficient, because Spark needs to first compute the list of
distinct values internally.
// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings")
Name of the column to pivot.
List of values that will be translated to columns in the output DataFrame.
1.6.0
Pivots a column of the current DataFrame
and performs the
specified aggregation.
Pivots a column of the current DataFrame
and performs the
specified aggregation.
There are two versions of pivot
function: one that requires the
caller to specify the list of distinct values to pivot on, and one
that does not. The latter is more concise but less efficient,
because Spark needs to first compute the list of distinct values
internally.
// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings")
Name of the column to pivot.
1.6.0
Compute the sum for each numeric columns for each group.
Compute the sum for each numeric columns for each group. The
resulting DataFrame
will also contain the grouping columns. When
specified columns are given, only compute the sum for them.
1.3.0
Applies a transformation to the underlying RelationalGroupedDataset.
Applies a transformation to the underlying RelationalGroupedDataset, it is used for transformations that can fail due to an AnalysisException.
Unpack the underlying RelationalGroupedDataset into a DataFrame.
Unpack the underlying RelationalGroupedDataset into a DataFrame, it is used for transformations that can fail due to an AnalysisException.