T
- The type of the DataSet, i.e., the type of the elements of the DataSet.public abstract class DataSet<T> extends Object
Modifier | Constructor and Description |
---|---|
protected |
DataSet(ExecutionEnvironment context,
TypeInformation<T> type) |
Modifier and Type | Method and Description |
---|---|
AggregateOperator<T> |
aggregate(Aggregations agg,
int field)
Applies an Aggregate transformation on a non-grouped
Tuple DataSet .Note: Only Tuple DataSets can be aggregated. The transformation applies a built-in Aggregation on a specified field
of a Tuple DataSet. |
protected static void |
checkSameExecutionContext(DataSet<?> set1,
DataSet<?> set2) |
<R> CoGroupOperator.CoGroupOperatorSets<T,R> |
coGroup(DataSet<R> other)
Initiates a CoGroup transformation.
A CoGroup transformation combines the elements of two DataSets into one DataSet. |
<R> CrossOperator.DefaultCross<T,R> |
cross(DataSet<R> other)
Initiates a Cross transformation.
A Cross transformation combines the elements of two DataSets into one DataSet. |
<R> CrossOperator.DefaultCross<T,R> |
crossWithHuge(DataSet<R> other)
Initiates a Cross transformation.
A Cross transformation combines the elements of two DataSets into one DataSet. |
<R> CrossOperator.DefaultCross<T,R> |
crossWithTiny(DataSet<R> other)
Initiates a Cross transformation.
A Cross transformation combines the elements of two DataSets into one DataSet. |
DistinctOperator<T> |
distinct()
|
DistinctOperator<T> |
distinct(int... fields)
|
<K> DistinctOperator<T> |
distinct(KeySelector<T,K> keyExtractor)
Returns a distinct set of a
DataSet using a KeySelector function. |
DistinctOperator<T> |
distinct(String... fields)
|
FilterOperator<T> |
filter(FilterFunction<T> filter)
Applies a Filter transformation on a
DataSet .The transformation calls a RichFilterFunction for each element of the DataSet
and retains only those element for which the function returns true. |
GroupReduceOperator<T,T> |
first(int n)
Returns a new set containing the first n elements in this
DataSet . |
<R> FlatMapOperator<T,R> |
flatMap(FlatMapFunction<T,R> flatMapper)
Applies a FlatMap transformation on a
DataSet .The transformation calls a RichFlatMapFunction for each element of the DataSet. |
ExecutionEnvironment |
getExecutionEnvironment()
Returns the
ExecutionEnvironment in which this DataSet is registered. |
TypeInformation<T> |
getType()
Returns the
TypeInformation for the type of this DataSet. |
UnsortedGrouping<T> |
groupBy(int... fields)
|
<K> UnsortedGrouping<T> |
groupBy(KeySelector<T,K> keyExtractor)
Groups a
DataSet using a KeySelector function. |
UnsortedGrouping<T> |
groupBy(String... fields)
Groups a
DataSet using field expressions. |
IterativeDataSet<T> |
iterate(int maxIterations)
Initiates an iterative part of the program that executes multiple times and feeds back data sets.
|
<R> DeltaIteration<T,R> |
iterateDelta(DataSet<R> workset,
int maxIterations,
int... keyPositions)
Initiates a delta iteration.
|
<R> JoinOperator.JoinOperatorSets<T,R> |
join(DataSet<R> other)
Initiates a Join transformation.
|
<R> JoinOperator.JoinOperatorSets<T,R> |
join(DataSet<R> other,
JoinOperatorBase.JoinHint strategy)
Initiates a Join transformation.
|
<R> JoinOperator.JoinOperatorSets<T,R> |
joinWithHuge(DataSet<R> other)
Initiates a Join transformation.
A Join transformation joins the elements of two DataSets on key equality and provides multiple ways to combine
joining elements into one DataSet.
This method also gives the hint to the optimizer that the second DataSet to join is much
larger than the first one.
This method returns a JoinOperator.JoinOperatorSets on which one of the where methods
can be called to define the join key of the first joining (i.e., this) DataSet. |
<R> JoinOperator.JoinOperatorSets<T,R> |
joinWithTiny(DataSet<R> other)
Initiates a Join transformation.
|
<R> MapOperator<T,R> |
map(MapFunction<T,R> mapper)
Applies a Map transformation on a
DataSet .The transformation calls a RichMapFunction for each element of the DataSet. |
<R> MapPartitionOperator<T,R> |
mapPartition(MapPartitionFunction<T,R> mapPartition)
Applies a Map-style operation to the entire partition of the data.
|
AggregateOperator<T> |
max(int field)
Syntactic sugar for aggregate (MAX, field)
|
ReduceOperator<T> |
maxBy(int... fields)
Applies a special case of a reduce transformation (maxBy) on a non-grouped
DataSet .The transformation consecutively calls a ReduceFunction
until only a single element remains which is the result of the transformation. |
AggregateOperator<T> |
min(int field)
Syntactic sugar for aggregate (MIN, field)
|
ReduceOperator<T> |
minBy(int... fields)
Applies a special case of a reduce transformation (minBy) on a non-grouped
DataSet .The transformation consecutively calls a ReduceFunction
until only a single element remains which is the result of the transformation. |
DataSink<T> |
output(OutputFormat<T> outputFormat)
Emits a DataSet using an
OutputFormat . |
PartitionOperator<T> |
partitionByHash(int... fields)
Hash-partitions a DataSet on the specified key fields.
|
<K extends Comparable<K>> |
partitionByHash(KeySelector<T,K> keyExtractor)
Partitions a DataSet using the specified KeySelector.
|
PartitionOperator<T> |
partitionByHash(String... fields)
Hash-partitions a DataSet on the specified key fields.
|
DataSink<T> |
print()
Writes a DataSet to the standard output stream (stdout).
For each element of the DataSet the result of Object.toString() is written. |
DataSink<T> |
printToErr()
Writes a DataSet to the standard error stream (stderr).
For each element of the DataSet the result of Object.toString() is written. |
ProjectOperator.Projection<T> |
project(int... fieldIndexes)
Initiates a Project transformation on a
Tuple DataSet .Note: Only Tuple DataSets can be projected. The transformation projects each Tuple of the DataSet onto a (sub)set of fields. This method returns a ProjectOperator.Projection on which ProjectOperator.Projection.types(Class) needs to
be called to completed the transformation. |
PartitionOperator<T> |
rebalance()
Enforces a rebalancing of the DataSet, i.e., the DataSet is evenly distributed over all parallel instances of the
following task.
|
ReduceOperator<T> |
reduce(ReduceFunction<T> reducer)
Applies a Reduce transformation on a non-grouped
DataSet .The transformation consecutively calls a RichReduceFunction
until only a single element remains which is the result of the transformation. |
<R> GroupReduceOperator<T,R> |
reduceGroup(GroupReduceFunction<T,R> reducer)
Applies a GroupReduce transformation on a non-grouped
DataSet .The transformation calls a RichGroupReduceFunction once with the full DataSet. |
<X> DataSet<X> |
runOperation(CustomUnaryOperation<T,X> operation)
Runs a
CustomUnaryOperation on the data set. |
AggregateOperator<T> |
sum(int field)
Syntactic sugar for aggregate (SUM, field)
|
UnionOperator<T> |
union(DataSet<T> other)
Creates a union of this DataSet with an other DataSet.
|
DataSink<T> |
write(FileOutputFormat<T> outputFormat,
String filePath)
Writes a DataSet using a
FileOutputFormat to a specified location. |
DataSink<T> |
write(FileOutputFormat<T> outputFormat,
String filePath,
FileSystem.WriteMode writeMode)
Writes a DataSet using a
FileOutputFormat to a specified location. |
DataSink<T> |
writeAsCsv(String filePath)
Writes a
Tuple DataSet as a CSV file to the specified location.Note: Only a Tuple DataSet can written as a CSV file. For each Tuple field the result of Object.toString() is written. |
DataSink<T> |
writeAsCsv(String filePath,
FileSystem.WriteMode writeMode)
Writes a
Tuple DataSet as a CSV file to the specified location.Note: Only a Tuple DataSet can written as a CSV file. For each Tuple field the result of Object.toString() is written. |
DataSink<T> |
writeAsCsv(String filePath,
String rowDelimiter,
String fieldDelimiter)
Writes a
Tuple DataSet as a CSV file to the specified location with the specified field and line delimiters.Note: Only a Tuple DataSet can written as a CSV file. For each Tuple field the result of Object.toString() is written. |
DataSink<T> |
writeAsCsv(String filePath,
String rowDelimiter,
String fieldDelimiter,
FileSystem.WriteMode writeMode)
Writes a
Tuple DataSet as a CSV file to the specified location with the specified field and line delimiters.Note: Only a Tuple DataSet can written as a CSV file. For each Tuple field the result of Object.toString() is written. |
DataSink<String> |
writeAsFormattedText(String filePath,
FileSystem.WriteMode writeMode,
TextOutputFormat.TextFormatter<T> formatter)
Writes a DataSet as a text file to the specified location.
For each element of the DataSet the result of TextOutputFormat.TextFormatter.format(Object) is written. |
DataSink<String> |
writeAsFormattedText(String filePath,
TextOutputFormat.TextFormatter<T> formatter)
Writes a DataSet as a text file to the specified location.
For each element of the DataSet the result of TextOutputFormat.TextFormatter.format(Object) is written. |
DataSink<T> |
writeAsText(String filePath)
Writes a DataSet as a text file to the specified location.
For each element of the DataSet the result of Object.toString() is written. |
DataSink<T> |
writeAsText(String filePath,
FileSystem.WriteMode writeMode)
Writes a DataSet as a text file to the specified location.
For each element of the DataSet the result of Object.toString() is written. |
protected DataSet(ExecutionEnvironment context, TypeInformation<T> type)
public ExecutionEnvironment getExecutionEnvironment()
ExecutionEnvironment
in which this DataSet is registered.ExecutionEnvironment
public TypeInformation<T> getType()
TypeInformation
for the type of this DataSet.TypeInformation
public <R> MapOperator<T,R> map(MapFunction<T,R> mapper)
DataSet
.RichMapFunction
for each element of the DataSet.
Each MapFunction call returns exactly one element.mapper
- The MapFunction that is called for each element of the DataSet.RichMapFunction
,
MapOperator
,
DataSet
public <R> MapPartitionOperator<T,R> mapPartition(MapPartitionFunction<T,R> mapPartition)
map()
and flatMap()
is preferable.mapPartition
- The MapPartitionFunction that is called for the full DataSet.MapPartitionFunction
,
MapPartitionOperator
,
DataSet
public <R> FlatMapOperator<T,R> flatMap(FlatMapFunction<T,R> flatMapper)
DataSet
.RichFlatMapFunction
for each element of the DataSet.
Each FlatMapFunction call can return any number of elements including none.flatMapper
- The FlatMapFunction that is called for each element of the DataSet.RichFlatMapFunction
,
FlatMapOperator
,
DataSet
public FilterOperator<T> filter(FilterFunction<T> filter)
DataSet
.RichFilterFunction
for each element of the DataSet
and retains only those element for which the function returns true. Elements for
which the function returns false are filtered.filter
- The FilterFunction that is called for each element of the DataSet.RichFilterFunction
,
FilterOperator
,
DataSet
public ProjectOperator.Projection<T> project(int... fieldIndexes)
Tuple
DataSet
.ProjectOperator.Projection
on which ProjectOperator.Projection.types(Class)
needs to
be called to completed the transformation.fieldIndexes
- The field indexes of the input tuples that are retained.
The order of fields in the output tuple corresponds to the order of field indexes.ProjectOperator
to complete the
Project transformation by calling ProjectOperator.Projection.types(Class)
.Tuple
,
DataSet
,
ProjectOperator.Projection
,
ProjectOperator
public AggregateOperator<T> aggregate(Aggregations agg, int field)
Tuple
DataSet
.Aggregation
on a specified field
of a Tuple DataSet. Additional aggregation functions can be added to the resulting
AggregateOperator
by calling AggregateOperator.and(Aggregations, int)
.agg
- The built-in aggregation function that is computed.field
- The index of the Tuple field on which the aggregation function is applied.Tuple
,
Aggregations
,
AggregateOperator
,
DataSet
public AggregateOperator<T> sum(int field)
field
- The index of the Tuple field on which the aggregation function is applied.AggregateOperator
public AggregateOperator<T> max(int field)
field
- The index of the Tuple field on which the aggregation function is applied.AggregateOperator
public AggregateOperator<T> min(int field)
field
- The index of the Tuple field on which the aggregation function is applied.AggregateOperator
public ReduceOperator<T> reduce(ReduceFunction<T> reducer)
DataSet
.RichReduceFunction
until only a single element remains which is the result of the transformation.
A ReduceFunction combines two elements into one new element of the same type.reducer
- The ReduceFunction that is applied on the DataSet.RichReduceFunction
,
ReduceOperator
,
DataSet
public <R> GroupReduceOperator<T,R> reduceGroup(GroupReduceFunction<T,R> reducer)
DataSet
.RichGroupReduceFunction
once with the full DataSet.
The GroupReduceFunction can iterate over all elements of the DataSet and emit any
number of output elements including none.reducer
- The GroupReduceFunction that is applied on the DataSet.RichGroupReduceFunction
,
GroupReduceOperator
,
DataSet
public ReduceOperator<T> minBy(int... fields)
DataSet
.ReduceFunction
until only a single element remains which is the result of the transformation.
A ReduceFunction combines two elements into one new element of the same type.fields
- Keys taken into account for finding the minimum.ReduceOperator
representing the minimum.public ReduceOperator<T> maxBy(int... fields)
DataSet
.ReduceFunction
until only a single element remains which is the result of the transformation.
A ReduceFunction combines two elements into one new element of the same type.fields
- Keys taken into account for finding the minimum.ReduceOperator
representing the minimum.public GroupReduceOperator<T,T> first(int n)
DataSet
.n
- The desired number of elements.public <K> DistinctOperator<T> distinct(KeySelector<T,K> keyExtractor)
DataSet
using a KeySelector
function.
The KeySelector function is called for each element of the DataSet and extracts a single key value on which the
decision is made if two items are distinct or not.keyExtractor
- The KeySelector function which extracts the key values from the DataSet on which the
distinction of the DataSet is decided.public DistinctOperator<T> distinct(int... fields)
Tuple
DataSet
using field position keys.
The field position keys specify the fields of Tuples on which the decision is made if two Tuples are distinct or
not.
Note: Field position keys can only be specified for Tuple DataSets.fields
- One or more field positions on which the distinction of the DataSet is decided.public DistinctOperator<T> distinct(String... fields)
Tuple
DataSet
using expression keys.
The field position keys specify the fields of Tuples or Pojos on which the decision is made if two elements are distinct or
not.
fields
- One or more field positions on which the distinction of the DataSet is decided.public DistinctOperator<T> distinct()
Tuple
DataSet
using all fields of the tuple.
Note: This operator can only be applied to Tuple DataSets.public <K> UnsortedGrouping<T> groupBy(KeySelector<T,K> keyExtractor)
DataSet
using a KeySelector
function.
The KeySelector function is called for each element of the DataSet and extracts a single
key value on which the DataSet is grouped.
This method returns an UnsortedGrouping
on which one of the following grouping transformation
can be applied.
UnsortedGrouping.sortGroup(int, org.apache.flink.api.common.operators.Order)
to get a SortedGrouping
.
UnsortedGrouping.aggregate(Aggregations, int)
to apply an Aggregate transformation.
UnsortedGrouping.reduce(org.apache.flink.api.common.functions.ReduceFunction)
to apply a Reduce transformation.
UnsortedGrouping.reduceGroup(org.apache.flink.api.common.functions.GroupReduceFunction)
to apply a GroupReduce transformation.
keyExtractor
- The KeySelector function which extracts the key values from the DataSet on which it is grouped.KeySelector
,
UnsortedGrouping
,
AggregateOperator
,
ReduceOperator
,
GroupReduceOperator
,
DataSet
public UnsortedGrouping<T> groupBy(int... fields)
Tuple
DataSet
using field position keys.UnsortedGrouping
on which one of the following grouping transformation
can be applied.
UnsortedGrouping.sortGroup(int, org.apache.flink.api.common.operators.Order)
to get a SortedGrouping
.
UnsortedGrouping.aggregate(Aggregations, int)
to apply an Aggregate transformation.
UnsortedGrouping.reduce(org.apache.flink.api.common.functions.ReduceFunction)
to apply a Reduce transformation.
UnsortedGrouping.reduceGroup(org.apache.flink.api.common.functions.GroupReduceFunction)
to apply a GroupReduce transformation.
fields
- One or more field positions on which the DataSet will be grouped.Tuple
,
UnsortedGrouping
,
AggregateOperator
,
ReduceOperator
,
GroupReduceOperator
,
DataSet
public UnsortedGrouping<T> groupBy(String... fields)
DataSet
using field expressions. A field expression is either the name of a public field
or a getter method with parentheses of the DataSet
S underlying type. A dot can be used to drill down
into objects, as in "field1.getInnerField2()"
.
This method returns an UnsortedGrouping
on which one of the following grouping transformation
can be applied.
UnsortedGrouping.sortGroup(int, org.apache.flink.api.common.operators.Order)
to get a SortedGrouping
.
UnsortedGrouping.aggregate(Aggregations, int)
to apply an Aggregate transformation.
UnsortedGrouping.reduce(org.apache.flink.api.common.functions.ReduceFunction)
to apply a Reduce transformation.
UnsortedGrouping.reduceGroup(org.apache.flink.api.common.functions.GroupReduceFunction)
to apply a GroupReduce transformation.
fields
- One or more field expressions on which the DataSet will be grouped.Tuple
,
UnsortedGrouping
,
AggregateOperator
,
ReduceOperator
,
GroupReduceOperator
,
DataSet
public <R> JoinOperator.JoinOperatorSets<T,R> join(DataSet<R> other)
DataSets
on key equality and provides multiple ways to combine
joining elements into one DataSet.
This method returns a JoinOperator.JoinOperatorSets
on which one of the where
methods
can be called to define the join key of the first joining (i.e., this) DataSet.other
- The other DataSet with which this DataSet is joined.JoinOperator.JoinOperatorSets
,
DataSet
public <R> JoinOperator.JoinOperatorSets<T,R> join(DataSet<R> other, JoinOperatorBase.JoinHint strategy)
DataSets
on key equality and provides multiple ways to combine
joining elements into one DataSet.
This method returns a JoinOperator.JoinOperatorSets
on which one of the where
methods
can be called to define the join key of the first joining (i.e., this) DataSet.other
- The other DataSet with which this DataSet is joined.strategy
- The strategy that should be used execute the join. If null
is give, then the
optimizer will pick the join strategy.JoinOperator.JoinOperatorSets
,
DataSet
public <R> JoinOperator.JoinOperatorSets<T,R> joinWithTiny(DataSet<R> other)
DataSets
on key equality and provides multiple ways to combine
joining elements into one DataSet.
This method also gives the hint to the optimizer that the second DataSet to join is much
smaller than the first one.
This method returns a JoinOperator.JoinOperatorSets
on which
JoinOperator.JoinOperatorSets.where(String...)
needs to be called to define the join key of the first
joining (i.e., this) DataSet.other
- The other DataSet with which this DataSet is joined.JoinOperator.JoinOperatorSets
,
DataSet
public <R> JoinOperator.JoinOperatorSets<T,R> joinWithHuge(DataSet<R> other)
DataSets
on key equality and provides multiple ways to combine
joining elements into one DataSet.
This method also gives the hint to the optimizer that the second DataSet to join is much
larger than the first one.
This method returns a JoinOperator.JoinOperatorSets
on which one of the where
methods
can be called to define the join key of the first joining (i.e., this) DataSet.other
- The other DataSet with which this DataSet is joined.JoinOperator.JoinOperatorSets
,
DataSet
public <R> CoGroupOperator.CoGroupOperatorSets<T,R> coGroup(DataSet<R> other)
DataSets
into one DataSet. It groups each DataSet individually on a key and
gives groups of both DataSets with equal keys together into a RichCoGroupFunction
.
If a DataSet has a group with no matching key in the other DataSet, the CoGroupFunction
is called with an empty group for the non-existing group.
The CoGroupFunction can iterate over the elements of both groups and return any number
of elements including none.
This method returns a CoGroupOperator.CoGroupOperatorSets
on which one of the where
methods
can be called to define the join key of the first joining (i.e., this) DataSet.other
- The other DataSet of the CoGroup transformation.CoGroupOperator.CoGroupOperatorSets
,
CoGroupOperator
,
DataSet
public <R> CrossOperator.DefaultCross<T,R> cross(DataSet<R> other)
DataSets
into one DataSet. It builds all pair combinations of elements of
both DataSets, i.e., it builds a Cartesian product.
The resulting CrossOperator.DefaultCross
wraps each pair of crossed elements into a Tuple2
, with
the element of the first input being the first field of the tuple and the element of the
second input being the second field of the tuple.
Call CrossOperator.DefaultCross.with(org.apache.flink.api.common.functions.CrossFunction)
to define a
CrossFunction
which is called for
each pair of crossed elements. The CrossFunction returns a exactly one element for each pair of input elements.
other
- The other DataSet with which this DataSet is crossed.CrossOperator.DefaultCross
,
CrossFunction
,
DataSet
,
Tuple2
public <R> CrossOperator.DefaultCross<T,R> crossWithTiny(DataSet<R> other)
DataSets
into one DataSet. It builds all pair combinations of elements of
both DataSets, i.e., it builds a Cartesian product.
This method also gives the hint to the optimizer that the second DataSet to cross is much
smaller than the first one.
The resulting CrossOperator.DefaultCross
wraps each pair of crossed elements into a Tuple2
, with
the element of the first input being the first field of the tuple and the element of the
second input being the second field of the tuple.
Call CrossOperator.DefaultCross.with(org.apache.flink.api.common.functions.CrossFunction)
to define a
CrossFunction
which is called for
each pair of crossed elements. The CrossFunction returns a exactly one element for each pair of input elements.
other
- The other DataSet with which this DataSet is crossed.CrossOperator.DefaultCross
,
CrossFunction
,
DataSet
,
Tuple2
public <R> CrossOperator.DefaultCross<T,R> crossWithHuge(DataSet<R> other)
DataSets
into one DataSet. It builds all pair combinations of elements of
both DataSets, i.e., it builds a Cartesian product.
This method also gives the hint to the optimizer that the second DataSet to cross is much
larger than the first one.
The resulting CrossOperator.DefaultCross
wraps each pair of crossed elements into a Tuple2
, with
the element of the first input being the first field of the tuple and the element of the
second input being the second field of the tuple.
Call CrossOperator.DefaultCross.with(org.apache.flink.api.common.functions.CrossFunction)
to define a
CrossFunction
which is called for
each pair of crossed elements. The CrossFunction returns a exactly one element for each pair of input elements.
other
- The other DataSet with which this DataSet is crossed.CrossOperator.DefaultCross
,
CrossFunction
,
DataSet
,
Tuple2
public IterativeDataSet<T> iterate(int maxIterations)
IterativeDataSet.closeWith(DataSet)
. The data set
given to the closeWith(DataSet)
method is the data set that will be fed back and used as the input
to the next iteration. The return value of the closeWith(DataSet)
method is the resulting
data set after the iteration has terminated.
An example of an iterative computation is as follows:
DataSet<Double> input = ...;
DataSet<Double> startOfIteration = input.iterate(10);
DataSet<Double> toBeFedBack = startOfIteration
.map(new MyMapper())
.groupBy(...).reduceGroup(new MyReducer());
DataSet<Double> result = startOfIteration.closeWith(toBeFedBack);
The iteration has a maximum number of times that it executes. A dynamic termination can be realized by using a
termination criterion (see IterativeDataSet.closeWith(DataSet, DataSet)
).
maxIterations
- The maximum number of times that the iteration is executed.IterativeDataSet.closeWith(DataSet)
.IterativeDataSet
public <R> DeltaIteration<T,R> iterateDelta(DataSet<R> workset, int maxIterations, int... keyPositions)
iterate(int)
,
but maintains state across the individual iteration steps. The Solution set, which represents the current state
at the beginning of each iteration can be obtained via DeltaIteration.getSolutionSet()
()}.
It can be be accessed by joining (or CoGrouping) with it. The DataSet that represents the workset of an iteration
can be obtained via DeltaIteration.getWorkset()
.
The solution set is updated by producing a delta for it, which is merged into the solution set at the end of each
iteration step.
The delta iteration must be closed by calling DeltaIteration.closeWith(DataSet, DataSet)
. The two
parameters are the delta for the solution set and the new workset (the data set that will be fed back).
The return value of the closeWith(DataSet, DataSet)
method is the resulting
data set after the iteration has terminated. Delta iterations terminate when the feed back data set
(the workset) is empty. In addition, a maximum number of steps is given as a fall back termination guard.
Elements in the solution set are uniquely identified by a key. When merging the solution set delta, contained elements with the same key are replaced.
NOTE: Delta iterations currently support only tuple valued data types. This restriction will be removed in the future. The key is specified by the tuple position.
A code example for a delta iteration is as follows
DeltaIteration<Tuple2<Long, Long>, Tuple2<Long, Long>> iteration =
initialState.iterateDelta(initialFeedbakSet, 100, 0);
DataSet<Tuple2<Long, Long>> delta = iteration.groupBy(0).aggregate(Aggregations.AVG, 1)
.join(iteration.getSolutionSet()).where(0).equalTo(0)
.flatMap(new ProjectAndFilter());
DataSet<Tuple2<Long, Long>> feedBack = delta.join(someOtherSet).where(...).equalTo(...).with(...);
// close the delta iteration (delta and new workset are identical)
DataSet<Tuple2<Long, Long>> result = iteration.closeWith(delta, feedBack);
workset
- The initial version of the data set that is fed back to the next iteration step (the workset).maxIterations
- The maximum number of iteration steps, as a fall back safeguard.keyPositions
- The position of the tuple fields that is used as the key of the solution set.DeltaIteration
public <X> DataSet<X> runOperation(CustomUnaryOperation<T,X> operation)
CustomUnaryOperation
on the data set. Custom operations are typically complex
operators that are composed of multiple steps.operation
- The operation to run.public UnionOperator<T> union(DataSet<T> other)
other
- The other DataSet which is unioned with the current DataSet.public PartitionOperator<T> partitionByHash(int... fields)
Important:This operation shuffles the whole DataSet over the network and can take significant amount of time.
fields
- The field indexes on which the DataSet is hash-partitioned.public PartitionOperator<T> partitionByHash(String... fields)
Important:This operation shuffles the whole DataSet over the network and can take significant amount of time.
fields
- The field expressions on which the DataSet is hash-partitioned.public <K extends Comparable<K>> PartitionOperator<T> partitionByHash(KeySelector<T,K> keyExtractor)
Important:This operation shuffles the whole DataSet over the network and can take significant amount of time.
keyExtractor
- The KeyExtractor with which the DataSet is hash-partitioned.KeySelector
public PartitionOperator<T> rebalance()
Important:This operation shuffles the whole DataSet over the network and can take significant amount of time.
public DataSink<T> writeAsText(String filePath)
Object.toString()
is written.filePath
- The path pointing to the location the text file is written to.TextOutputFormat
public DataSink<T> writeAsText(String filePath, FileSystem.WriteMode writeMode)
Object.toString()
is written.filePath
- The path pointing to the location the text file is written to.writeMode
- Control the behavior for existing files. Options are NO_OVERWRITE and OVERWRITE.TextOutputFormat
public DataSink<String> writeAsFormattedText(String filePath, TextOutputFormat.TextFormatter<T> formatter)
TextOutputFormat.TextFormatter.format(Object)
is written.filePath
- The path pointing to the location the text file is written to.formatter
- formatter that is applied on every element of the DataSet.TextOutputFormat
public DataSink<String> writeAsFormattedText(String filePath, FileSystem.WriteMode writeMode, TextOutputFormat.TextFormatter<T> formatter)
TextOutputFormat.TextFormatter.format(Object)
is written.filePath
- The path pointing to the location the text file is written to.writeMode
- Control the behavior for existing files. Options are NO_OVERWRITE and OVERWRITE.formatter
- formatter that is applied on every element of the DataSet.TextOutputFormat
public DataSink<T> writeAsCsv(String filePath)
Tuple
DataSet as a CSV file to the specified location.Object.toString()
is written.
Tuple fields are separated by the default field delimiter "comma" (,)
.\n
).filePath
- The path pointing to the location the CSV file is written to.Tuple
,
CsvOutputFormat
public DataSink<T> writeAsCsv(String filePath, FileSystem.WriteMode writeMode)
Tuple
DataSet as a CSV file to the specified location.Object.toString()
is written.
Tuple fields are separated by the default field delimiter "comma" (,)
.\n
).filePath
- The path pointing to the location the CSV file is written to.writeMode
- The behavior regarding existing files. Options are NO_OVERWRITE and OVERWRITE.Tuple
,
CsvOutputFormat
public DataSink<T> writeAsCsv(String filePath, String rowDelimiter, String fieldDelimiter)
Tuple
DataSet as a CSV file to the specified location with the specified field and line delimiters.Object.toString()
is written.filePath
- The path pointing to the location the CSV file is written to.rowDelimiter
- The row delimiter to separate Tuples.fieldDelimiter
- The field delimiter to separate Tuple fields.Tuple
,
CsvOutputFormat
public DataSink<T> writeAsCsv(String filePath, String rowDelimiter, String fieldDelimiter, FileSystem.WriteMode writeMode)
Tuple
DataSet as a CSV file to the specified location with the specified field and line delimiters.Object.toString()
is written.filePath
- The path pointing to the location the CSV file is written to.rowDelimiter
- The row delimiter to separate Tuples.fieldDelimiter
- The field delimiter to separate Tuple fields.writeMode
- The behavior regarding existing files. Options are NO_OVERWRITE and OVERWRITE.Tuple
,
CsvOutputFormat
public DataSink<T> print()
Object.toString()
is written.public DataSink<T> printToErr()
Object.toString()
is written.public DataSink<T> write(FileOutputFormat<T> outputFormat, String filePath)
FileOutputFormat
to a specified location.
This method adds a data sink to the program.outputFormat
- The FileOutputFormat to write the DataSet.filePath
- The path to the location where the DataSet is written.FileOutputFormat
public DataSink<T> write(FileOutputFormat<T> outputFormat, String filePath, FileSystem.WriteMode writeMode)
FileOutputFormat
to a specified location.
This method adds a data sink to the program.outputFormat
- The FileOutputFormat to write the DataSet.filePath
- The path to the location where the DataSet is written.writeMode
- The mode of writing, indicating whether to overwrite existing files.FileOutputFormat
public DataSink<T> output(OutputFormat<T> outputFormat)
OutputFormat
. This method adds a data sink to the program.
Programs may have multiple data sinks. A DataSet may also have multiple consumers (data sinks
or transformations) at the same time.outputFormat
- The OutputFormat to process the DataSet.OutputFormat
,
DataSink
Copyright © 2014 The Apache Software Foundation. All rights reserved.