public abstract class ExecutionEnvironment extends Object
LocalEnvironment
will cause execution in the current JVM, a
RemoteEnvironment
will cause execution on a remote setup.
The environment provides methods to control the job execution (such as setting the parallelism) and to interact with the outside world (data access).
Please note that the execution environment needs strong type information for the input and return types of all operations that are executed. This means that the environments needs to know that the return value of an operation is for example a Tuple of String and Integer. Because the Java compiler throws much of the generic type information away, most methods attempt to re- obtain that information using reflection. In certain cases, it may be necessary to manually supply that information to some of the methods.
LocalEnvironment
,
RemoteEnvironment
Modifier | Constructor and Description |
---|---|
protected |
ExecutionEnvironment()
Creates a new Execution Environment.
|
Modifier and Type | Method and Description |
---|---|
<X> DataSource<X> |
createInput(InputFormat<X,?> inputFormat)
Generic method to create an input DataSet with in
InputFormat . |
<X> DataSource<X> |
createInput(InputFormat<X,?> inputFormat,
TypeInformation<X> producedType)
Generic method to create an input DataSet with in
InputFormat . |
static LocalEnvironment |
createLocalEnvironment()
Creates a
LocalEnvironment . |
static LocalEnvironment |
createLocalEnvironment(int degreeOfParallelism)
Creates a
LocalEnvironment . |
JavaPlan |
createProgramPlan()
Creates the program's
Plan . |
JavaPlan |
createProgramPlan(String jobName)
Creates the program's
Plan . |
JavaPlan |
createProgramPlan(String jobName,
boolean clearSinks)
Creates the program's
Plan . |
static ExecutionEnvironment |
createRemoteEnvironment(String host,
int port,
int degreeOfParallelism,
String... jarFiles)
Creates a
RemoteEnvironment . |
static ExecutionEnvironment |
createRemoteEnvironment(String host,
int port,
String... jarFiles)
Creates a
RemoteEnvironment . |
protected static void |
enableLocalExecution(boolean enabled) |
JobExecutionResult |
execute()
Triggers the program execution.
|
abstract JobExecutionResult |
execute(String jobName)
Triggers the program execution.
|
<X> DataSource<X> |
fromCollection(Collection<X> data)
Creates a DataSet from the given non-empty collection.
|
<X> DataSource<X> |
fromCollection(Collection<X> data,
TypeInformation<X> type)
Creates a DataSet from the given non-empty collection.
|
<X> DataSource<X> |
fromCollection(Iterator<X> data,
Class<X> type)
Creates a DataSet from the given iterator.
|
<X> DataSource<X> |
fromCollection(Iterator<X> data,
TypeInformation<X> type)
Creates a DataSet from the given iterator.
|
<X> DataSource<X> |
fromElements(X... data)
Creates a new data set that contains the given elements.
|
<X> DataSource<X> |
fromParallelCollection(SplittableIterator<X> iterator,
Class<X> type)
Creates a new data set that contains elements in the iterator.
|
<X> DataSource<X> |
fromParallelCollection(SplittableIterator<X> iterator,
TypeInformation<X> type)
Creates a new data set that contains elements in the iterator.
|
DataSource<Long> |
generateSequence(long from,
long to)
Creates a new data set that contains a sequence of numbers.
|
ExecutionConfig |
getConfig()
Gets the config object.
|
int |
getDegreeOfParallelism()
Gets the degree of parallelism with which operation are executed by default.
|
static ExecutionEnvironment |
getExecutionEnvironment()
Creates an execution environment that represents the context in which the program is currently executed.
|
abstract String |
getExecutionPlan()
Creates the plan with which the system will execute the program, and returns it as
a String using a JSON representation of the execution data flow graph.
|
UUID |
getId()
Gets the UUID by which this environment is identified.
|
String |
getIdString()
Gets the UUID by which this environment is identified, as a string.
|
int |
getNumberOfExecutionRetries()
Gets the number of times the system will try to re-execute failed tasks.
|
protected static void |
initializeContextEnvironment(ExecutionEnvironmentFactory ctx) |
protected static boolean |
isContextEnvironmentSet() |
static boolean |
localExecutionIsAllowed() |
CsvReader |
readCsvFile(String filePath)
Creates a CSV reader to read a comma separated value (CSV) file.
|
<X> DataSource<X> |
readFile(FileInputFormat<X> inputFormat,
String filePath) |
<X> DataSource<X> |
readFileOfPrimitives(String filePath,
Class<X> typeClass)
Creates a DataSet that represents the primitive type produced by reading the given file line wise.
|
<X> DataSource<X> |
readFileOfPrimitives(String filePath,
String delimiter,
Class<X> typeClass)
Creates a DataSet that represents the primitive type produced by reading the given file in delimited way.
|
DataSource<String> |
readTextFile(String filePath)
Creates a DataSet that represents the Strings produced by reading the given file line wise.
|
DataSource<String> |
readTextFile(String filePath,
String charsetName)
Creates a DataSet that represents the Strings produced by reading the given file line wise.
|
DataSource<StringValue> |
readTextFileWithValue(String filePath)
Creates a DataSet that represents the Strings produced by reading the given file line wise.
|
DataSource<StringValue> |
readTextFileWithValue(String filePath,
String charsetName,
boolean skipInvalidLines)
Creates a DataSet that represents the Strings produced by reading the given file line wise.
|
void |
registerCachedFile(String filePath,
String name)
Registers a file at the distributed cache under the given name.
|
void |
registerCachedFile(String filePath,
String name,
boolean executable)
Registers a file at the distributed cache under the given name.
|
protected void |
registerCachedFilesWithPlan(Plan p)
Registers all files that were registered at this execution environment's cache registry of the
given plan's cache registry.
|
void |
setConfig(ExecutionConfig config)
Sets the config object.
|
static void |
setDefaultLocalParallelism(int degreeOfParallelism)
Sets the default parallelism that will be used for the local execution environment created by
createLocalEnvironment() . |
void |
setDegreeOfParallelism(int degreeOfParallelism)
Sets the degree of parallelism (DOP) for operations executed through this environment.
|
void |
setNumberOfExecutionRetries(int numberOfExecutionRetries)
Sets the number of times that failed tasks are re-executed.
|
protected ExecutionEnvironment()
public void setConfig(ExecutionConfig config)
public ExecutionConfig getConfig()
public int getDegreeOfParallelism()
Operator.setParallelism(int)
. Other operations may need to run with a different
degree of parallelism - for example calling
DataSet.reduce(org.apache.flink.api.common.functions.ReduceFunction)
over the entire
set will insert eventually an operation that runs non-parallel (degree of parallelism of one).-1
, if the environments default parallelism should be used.public void setDegreeOfParallelism(int degreeOfParallelism)
This method overrides the default parallelism for this environment.
The LocalEnvironment
uses by default a value equal to the number of hardware
contexts (CPU cores / threads). When executing the program via the command line client
from a JAR file, the default degree of parallelism is the one configured for that setup.
degreeOfParallelism
- The degree of parallelismpublic void setNumberOfExecutionRetries(int numberOfExecutionRetries)
-1
indicates that the system
default value (as defined in the configuration) should be used.numberOfExecutionRetries
- The number of times the system will try to re-execute failed tasks.public int getNumberOfExecutionRetries()
-1
indicates that the system default value (as defined in the configuration)
should be used.public UUID getId()
getIdString()
public String getIdString()
getId()
public DataSource<String> readTextFile(String filePath)
filePath
- The path of the file, as a URI (e.g., "file:///some/local/file" or "hdfs://host:port/file/path").public DataSource<String> readTextFile(String filePath, String charsetName)
Charset
with the given name will be used to read the files.filePath
- The path of the file, as a URI (e.g., "file:///some/local/file" or "hdfs://host:port/file/path").charsetName
- The name of the character set used to read the file.public DataSource<StringValue> readTextFileWithValue(String filePath)
readTextFile(String)
, but it produces a DataSet with mutable
StringValue
objects, rather than Java Strings. StringValues can be used to tune implementations
to be less object and garbage collection heavy.
The file will be read with the system's default character set.
filePath
- The path of the file, as a URI (e.g., "file:///some/local/file" or "hdfs://host:port/file/path").public DataSource<StringValue> readTextFileWithValue(String filePath, String charsetName, boolean skipInvalidLines)
readTextFile(String, String)
, but it produces a DataSet with mutable
StringValue
objects, rather than Java Strings. StringValues can be used to tune implementations
to be less object and garbage collection heavy.
The Charset
with the given name will be used to read the files.
filePath
- The path of the file, as a URI (e.g., "file:///some/local/file" or "hdfs://host:port/file/path").charsetName
- The name of the character set used to read the file.skipInvalidLines
- A flag to indicate whether to skip lines that cannot be read with the given character set.public <X> DataSource<X> readFileOfPrimitives(String filePath, Class<X> typeClass)
readCsvFile(String)
with single field, but it produces a DataSet not through
Tuple1
.filePath
- The path of the file, as a URI (e.g., "file:///some/local/file" or "hdfs://host:port/file/path").typeClass
- The primitive type class to be read.public <X> DataSource<X> readFileOfPrimitives(String filePath, String delimiter, Class<X> typeClass)
readCsvFile(String)
with single field, but it produces a DataSet not through
Tuple1
.filePath
- The path of the file, as a URI (e.g., "file:///some/local/file" or "hdfs://host:port/file/path").delimiter
- The delimiter of the given file.typeClass
- The primitive type class to be read.public CsvReader readCsvFile(String filePath)
filePath
- The path of the CSV file.public <X> DataSource<X> readFile(FileInputFormat<X> inputFormat, String filePath)
public <X> DataSource<X> createInput(InputFormat<X,?> inputFormat)
InputFormat
. The DataSet will not be
immediately created - instead, this method returns a DataSet that will be lazily created from
the input format once the program is executed.
Since all data sets need specific information about their types, this method needs to determine
the type of the data produced by the input format. It will attempt to determine the data type
by reflection, unless the the input format implements the ResultTypeQueryable
interface.
In the latter case, this method will invoke the ResultTypeQueryable.getProducedType()
method to determine data type produced by the input format.
inputFormat
- The input format used to create the data set.createInput(InputFormat, TypeInformation)
public <X> DataSource<X> createInput(InputFormat<X,?> inputFormat, TypeInformation<X> producedType)
InputFormat
. The DataSet will not be
immediately created - instead, this method returns a DataSet that will be lazily created from
the input format once the program is executed.
The data set is typed to the given TypeInformation. This method is intended for input formats that
where the return type cannot be determined by reflection analysis, and that do not implement the
ResultTypeQueryable
interface.
inputFormat
- The input format used to create the data set.createInput(InputFormat)
public <X> DataSource<X> fromCollection(Collection<X> data)
Serializable
), because the framework may move the elements into the cluster
if needed.
The framework will try and determine the exact type from the collection elements.
In case of generic elements, it may be necessary to manually supply the type information
via fromCollection(Collection, TypeInformation)
.
Note that this operation will result in a non-parallel data source, i.e. a data source with a degree of parallelism of one.
data
- The collection of elements to create the data set from.fromCollection(Collection, TypeInformation)
public <X> DataSource<X> fromCollection(Collection<X> data, TypeInformation<X> type)
Serializable
), because the framework may move the elements into the cluster
if needed.
Note that this operation will result in a non-parallel data source, i.e. a data source with a degree of parallelism of one.
The returned DataSet is typed to the given TypeInformation.
data
- The collection of elements to create the data set from.type
- The TypeInformation for the produced data set.fromCollection(Collection)
public <X> DataSource<X> fromCollection(Iterator<X> data, Class<X> type)
The iterator must be serializable (as defined in Serializable
), because the
framework may move it to a remote environment, if needed.
Note that this operation will result in a non-parallel data source, i.e. a data source with a degree of parallelism of one.
data
- The collection of elements to create the data set from.type
- The class of the data produced by the iterator. Must not be a generic class.fromCollection(Iterator, TypeInformation)
public <X> DataSource<X> fromCollection(Iterator<X> data, TypeInformation<X> type)
fromCollection(Iterator, Class)
does not supply all type information.
The iterator must be serializable (as defined in Serializable
), because the
framework may move it to a remote environment, if needed.
Note that this operation will result in a non-parallel data source, i.e. a data source with a degree of parallelism of one.
data
- The collection of elements to create the data set from.type
- The TypeInformation for the produced data set.fromCollection(Iterator, Class)
public <X> DataSource<X> fromElements(X... data)
String
or Integer
. The sequence of elements must not be empty.
Furthermore, the elements must be serializable (as defined in Serializable
, because the
execution environment may ship the elements into the cluster.
The framework will try and determine the exact type from the collection elements.
In case of generic elements, it may be necessary to manually supply the type information
via fromCollection(Collection, TypeInformation)
.
Note that this operation will result in a non-parallel data source, i.e. a data source with a degree of parallelism of one.
data
- The elements to make up the data set.public <X> DataSource<X> fromParallelCollection(SplittableIterator<X> iterator, Class<X> type)
Serializable
, because the
execution environment may ship the elements into the cluster.
Because the iterator will remain unmodified until the actual execution happens, the type of data returned by the iterator must be given explicitly in the form of the type class (this is due to the fact that the Java compiler erases the generic type information).
iterator
- The iterator that produces the elements of the data set.type
- The class of the data produced by the iterator. Must not be a generic class.fromParallelCollection(SplittableIterator, TypeInformation)
public <X> DataSource<X> fromParallelCollection(SplittableIterator<X> iterator, TypeInformation<X> type)
Serializable
, because the
execution environment may ship the elements into the cluster.
Because the iterator will remain unmodified until the actual execution happens, the type of data
returned by the iterator must be given explicitly in the form of the type information.
This method is useful for cases where the type is generic. In that case, the type class
(as given in fromParallelCollection(SplittableIterator, Class)
does not supply all type information.
iterator
- The iterator that produces the elements of the data set.type
- The TypeInformation for the produced data set.fromParallelCollection(SplittableIterator, Class)
public DataSource<Long> generateSequence(long from, long to)
from
- The number to start at (inclusive).to
- The number to stop at (inclusive).[from, to]
interval.public JobExecutionResult execute() throws Exception
DataSet.print()
,
writing results (e.g. DataSet.writeAsText(String)
,
DataSet.write(org.apache.flink.api.common.io.FileOutputFormat, String)
, or other generic
data sinks created with DataSet.output(org.apache.flink.api.common.io.OutputFormat)
.
The program execution will be logged and displayed with a generated default name.
Exception
- Thrown, if the program executions fails.public abstract JobExecutionResult execute(String jobName) throws Exception
DataSet.print()
,
writing results (e.g. DataSet.writeAsText(String)
,
DataSet.write(org.apache.flink.api.common.io.FileOutputFormat, String)
, or other generic
data sinks created with DataSet.output(org.apache.flink.api.common.io.OutputFormat)
.
The program execution will be logged and displayed with the given job name.
Exception
- Thrown, if the program executions fails.public abstract String getExecutionPlan() throws Exception
Exception
- Thrown, if the compiler could not be instantiated, or the master could not
be contacted to retrieve information relevant to the execution planning.public void registerCachedFile(String filePath, String name)
The RuntimeContext
can be obtained inside UDFs via
RichFunction.getRuntimeContext()
and provides access
DistributedCache
via
RuntimeContext.getDistributedCache()
.
filePath
- The path of the file, as a URI (e.g. "file:///some/path" or "hdfs://host:port/and/path")name
- The name under which the file is registered.public void registerCachedFile(String filePath, String name, boolean executable)
The RuntimeContext
can be obtained inside UDFs via
RichFunction.getRuntimeContext()
and provides access
DistributedCache
via
RuntimeContext.getDistributedCache()
.
filePath
- The path of the file, as a URI (e.g. "file:///some/path" or "hdfs://host:port/and/path")name
- The name under which the file is registered.executable
- flag indicating whether the file should be executableprotected void registerCachedFilesWithPlan(Plan p) throws IOException
p
- The plan to register files at.IOException
- Thrown if checks for existence and sanity fail.public JavaPlan createProgramPlan()
Plan
. The plan is a description of all data sources, data sinks,
and operations and how they interact, as an isolated unit that can be executed with a
PlanExecutor
. Obtaining a plan and starting it with an
executor is an alternative way to run a program and is only possible if the program consists
only of distributed operations.
This automatically starts a new stage of execution.public JavaPlan createProgramPlan(String jobName)
Plan
. The plan is a description of all data sources, data sinks,
and operations and how they interact, as an isolated unit that can be executed with a
PlanExecutor
. Obtaining a plan and starting it with an
executor is an alternative way to run a program and is only possible if the program consists
only of distributed operations.
This automatically starts a new stage of execution.jobName
- The name attached to the plan (displayed in logs and monitoring).public JavaPlan createProgramPlan(String jobName, boolean clearSinks)
Plan
. The plan is a description of all data sources, data sinks,
and operations and how they interact, as an isolated unit that can be executed with a
PlanExecutor
. Obtaining a plan and starting it with an
executor is an alternative way to run a program and is only possible if the program consists
only of distributed operations.jobName
- The name attached to the plan (displayed in logs and monitoring).clearSinks
- Whether or not to start a new stage of execution.public static ExecutionEnvironment getExecutionEnvironment()
createLocalEnvironment()
. If the program is invoked from within the command line client to be
submitted to a cluster, this method returns the execution environment of this cluster.public static LocalEnvironment createLocalEnvironment()
LocalEnvironment
. The local execution environment will run the program in a
multi-threaded fashion in the same JVM as the environment was created in. The default degree of
parallelism of the local environment is the number of hardware contexts (CPU cores / threads),
unless it was specified differently by setDefaultLocalParallelism(int)
.public static LocalEnvironment createLocalEnvironment(int degreeOfParallelism)
LocalEnvironment
. The local execution environment will run the program in a
multi-threaded fashion in the same JVM as the environment was created in. It will use the
degree of parallelism specified in the parameter.degreeOfParallelism
- The degree of parallelism for the local environment.public static ExecutionEnvironment createRemoteEnvironment(String host, int port, String... jarFiles)
RemoteEnvironment
. The remote environment sends (parts of) the program
to a cluster for execution. Note that all file paths used in the program must be accessible from the
cluster. The execution will use the cluster's default degree of parallelism, unless the parallelism is
set explicitly via setDegreeOfParallelism(int)
.host
- The host name or address of the master (JobManager), where the program should be executed.port
- The port of the master (JobManager), where the program should be executed.jarFiles
- The JAR files with code that needs to be shipped to the cluster. If the program uses
user-defined functions, user-defined input formats, or any libraries, those must be
provided in the JAR files.public static ExecutionEnvironment createRemoteEnvironment(String host, int port, int degreeOfParallelism, String... jarFiles)
RemoteEnvironment
. The remote environment sends (parts of) the program
to a cluster for execution. Note that all file paths used in the program must be accessible from the
cluster. The execution will use the specified degree of parallelism.host
- The host name or address of the master (JobManager), where the program should be executed.port
- The port of the master (JobManager), where the program should be executed.degreeOfParallelism
- The degree of parallelism to use during the execution.jarFiles
- The JAR files with code that needs to be shipped to the cluster. If the program uses
user-defined functions, user-defined input formats, or any libraries, those must be
provided in the JAR files.public static void setDefaultLocalParallelism(int degreeOfParallelism)
createLocalEnvironment()
.degreeOfParallelism
- The degree of parallelism to use as the default local parallelism.protected static void initializeContextEnvironment(ExecutionEnvironmentFactory ctx)
protected static boolean isContextEnvironmentSet()
protected static void enableLocalExecution(boolean enabled)
public static boolean localExecutionIsAllowed()
Copyright © 2015 The Apache Software Foundation. All rights reserved.