Class Execution
- java.lang.Object
-
- org.apache.flink.runtime.executiongraph.Execution
-
- All Implemented Interfaces:
org.apache.flink.api.common.Archiveable<ArchivedExecution>
,AccessExecution
,LogicalSlot.Payload
public class Execution extends Object implements AccessExecution, org.apache.flink.api.common.Archiveable<ArchivedExecution>, LogicalSlot.Payload
A single execution of a vertex. While anExecutionVertex
can be executed multiple times (for recovery, re-computation, re-configuration), this class tracks the state of a single execution of that vertex and the resources.Lock free state transitions
In several points of the code, we need to deal with possible concurrent state changes and actions. For example, while the call to deploy a task (send it to the TaskManager) happens, the task gets cancelled.
We could lock the entire portion of the code (decision to deploy, deploy, set state to running) such that it is guaranteed that any "cancel command" will only pick up after deployment is done and that the "cancel command" call will never overtake the deploying call.
This blocks the threads big time, because the remote calls may take long. Depending of their locking behavior, it may even result in distributed deadlocks (unless carefully avoided). We therefore use atomic state updates and occasional double-checking to ensure that the state after a completed call is as expected, and trigger correcting actions if it is not. Many actions are also idempotent (like canceling).
-
-
Constructor Summary
Constructors Constructor Description Execution(Executor executor, ExecutionVertex vertex, int attemptNumber, long startTimestamp, Duration rpcTimeout)
Creates a new Execution attempt.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description ArchivedExecution
archive()
void
cancel()
static ResultPartitionDeploymentDescriptor
createResultPartitionDeploymentDescriptor(IntermediateResultPartition partition, ShuffleDescriptor shuffleDescriptor)
void
deploy()
Deploys the execution to the previously assigned resource.void
fail(Throwable t)
This method fails the vertex due to an external condition.AllocationID
getAssignedAllocationID()
LogicalSlot
getAssignedResource()
TaskManagerLocation
getAssignedResourceLocation()
Returns theTaskManagerLocation
for this execution.ExecutionAttemptID
getAttemptId()
Returns theExecutionAttemptID
for this Execution.int
getAttemptNumber()
Returns the attempt number for this execution.Optional<ErrorInfo>
getFailureInfo()
Returns the exception that caused the job to fail.CompletableFuture<?>
getInitializingOrRunningFuture()
Gets a future that completes once the task execution reaches one of the statesExecutionState.INITIALIZING
orExecutionState.RUNNING
.IOMetrics
getIOMetrics()
Optional<org.apache.flink.core.io.InputSplit>
getNextInputSplit()
int
getParallelSubtaskIndex()
Returns the subtask index of this execution.CompletableFuture<?>
getReleaseFuture()
Gets the release future which is completed once the execution reaches a terminal state and the assigned resource has been released.Optional<ResultPartitionDeploymentDescriptor>
getResultPartitionDeploymentDescriptor(IntermediateResultPartitionID id)
ExecutionState
getState()
Returns the currentExecutionState
for this execution.long
getStateEndTimestamp(ExecutionState state)
Returns the end timestamp for the givenExecutionState
.long[]
getStateEndTimestamps()
Returns the end timestamps for everyExecutionState
.long
getStateTimestamp(ExecutionState state)
Returns the timestamp for the givenExecutionState
.long[]
getStateTimestamps()
Returns the timestamps for everyExecutionState
.CompletableFuture<TaskManagerLocation>
getTaskManagerLocationFuture()
JobManagerTaskRestore
getTaskRestore()
CompletableFuture<ExecutionState>
getTerminalStateFuture()
Gets a future that completes once the task execution reaches a terminal state.Map<String,org.apache.flink.api.common.accumulators.Accumulator<?,?>>
getUserAccumulators()
StringifiedAccumulatorResult[]
getUserAccumulatorsStringified()
Returns the user-defined accumulators as strings.ExecutionVertex
getVertex()
String
getVertexWithAttempt()
boolean
isFinished()
void
markFailed(Throwable t)
This method marks the task as failed, but will make no attempt to remove task execution from the task manager.void
markFinished()
void
notifyCheckpointAborted(long abortCheckpointId, long latestCompletedCheckpointId, long timestamp)
Notify the task of this execution about a aborted checkpoint.void
notifyCheckpointOnComplete(long completedCheckpointId, long completedTimestamp, long lastSubsumedCheckpointId)
Notify the task of this execution about a completed checkpoint and the last subsumed checkpoint id if possible.void
recoverExecution(ExecutionAttemptID attemptId, TaskManagerLocation location, Map<String,org.apache.flink.api.common.accumulators.Accumulator<?,?>> userAccumulators, IOMetrics metrics)
Recover the execution attempt status after JM failover.void
recoverProducedPartitions(Map<IntermediateResultPartitionID,ResultPartitionDeploymentDescriptor> producedPartitions)
CompletableFuture<Void>
registerProducedPartitions(TaskManagerLocation location)
CompletableFuture<Acknowledge>
sendOperatorEvent(OperatorID operatorId, org.apache.flink.util.SerializedValue<OperatorEvent> event)
Sends the operator event to the Task on the Task Executor.void
setAccumulators(Map<String,org.apache.flink.api.common.accumulators.Accumulator<?,?>> userAccumulators)
Update accumulators (discarded when the Execution has already been terminated).void
setInitialState(JobManagerTaskRestore taskRestore)
Sets the initial state for the execution.CompletableFuture<?>
suspend()
String
toString()
void
transitionState(ExecutionState targetState)
CompletableFuture<Acknowledge>
triggerCheckpoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions)
Trigger a new checkpoint on the task of this execution.CompletableFuture<Acknowledge>
triggerSynchronousSavepoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions)
Trigger a new checkpoint on the task of this execution.boolean
tryAssignResource(LogicalSlot logicalSlot)
Tries to assign the given slot to the execution.
-
-
-
Constructor Detail
-
Execution
public Execution(Executor executor, ExecutionVertex vertex, int attemptNumber, long startTimestamp, Duration rpcTimeout)
Creates a new Execution attempt.- Parameters:
executor
- The executor used to dispatch callbacks from futures and asynchronous RPC calls.vertex
- The execution vertex to which this Execution belongsattemptNumber
- The execution attempt number.startTimestamp
- The timestamp that marks the creation of this ExecutionrpcTimeout
- The rpcTimeout for RPC calls like deploy/cancel/stop.
-
-
Method Detail
-
getVertex
public ExecutionVertex getVertex()
-
getAttemptId
public ExecutionAttemptID getAttemptId()
Description copied from interface:AccessExecution
Returns theExecutionAttemptID
for this Execution.- Specified by:
getAttemptId
in interfaceAccessExecution
- Returns:
- ExecutionAttemptID for this execution
-
getAttemptNumber
public int getAttemptNumber()
Description copied from interface:AccessExecution
Returns the attempt number for this execution.- Specified by:
getAttemptNumber
in interfaceAccessExecution
- Returns:
- attempt number for this execution.
-
getState
public ExecutionState getState()
Description copied from interface:AccessExecution
Returns the currentExecutionState
for this execution.- Specified by:
getState
in interfaceAccessExecution
- Returns:
- execution state for this execution
-
getAssignedAllocationID
@Nullable public AllocationID getAssignedAllocationID()
-
getTaskManagerLocationFuture
public CompletableFuture<TaskManagerLocation> getTaskManagerLocationFuture()
-
getAssignedResource
public LogicalSlot getAssignedResource()
-
getResultPartitionDeploymentDescriptor
public Optional<ResultPartitionDeploymentDescriptor> getResultPartitionDeploymentDescriptor(IntermediateResultPartitionID id)
-
tryAssignResource
public boolean tryAssignResource(LogicalSlot logicalSlot)
Tries to assign the given slot to the execution. The assignment works only if the Execution is in state SCHEDULED. Returns true, if the resource could be assigned.- Parameters:
logicalSlot
- to assign to this execution- Returns:
- true if the slot could be assigned to the execution, otherwise false
-
getNextInputSplit
public Optional<org.apache.flink.core.io.InputSplit> getNextInputSplit()
-
getAssignedResourceLocation
public TaskManagerLocation getAssignedResourceLocation()
Description copied from interface:AccessExecution
Returns theTaskManagerLocation
for this execution.- Specified by:
getAssignedResourceLocation
in interfaceAccessExecution
- Returns:
- taskmanager location for this execution.
-
getFailureInfo
public Optional<ErrorInfo> getFailureInfo()
Description copied from interface:AccessExecution
Returns the exception that caused the job to fail. This is the first root exception that was not recoverable and triggered job failure.- Specified by:
getFailureInfo
in interfaceAccessExecution
- Returns:
- an
Optional
ofErrorInfo
containing theThrowable
and the time it was registered if an error occurred. If no error occurred an emptyOptional
will be returned.
-
getStateTimestamps
public long[] getStateTimestamps()
Description copied from interface:AccessExecution
Returns the timestamps for everyExecutionState
.- Specified by:
getStateTimestamps
in interfaceAccessExecution
- Returns:
- timestamps for each state
-
getStateEndTimestamps
public long[] getStateEndTimestamps()
Description copied from interface:AccessExecution
Returns the end timestamps for everyExecutionState
.- Specified by:
getStateEndTimestamps
in interfaceAccessExecution
- Returns:
- timestamps for each state
-
getStateTimestamp
public long getStateTimestamp(ExecutionState state)
Description copied from interface:AccessExecution
Returns the timestamp for the givenExecutionState
.- Specified by:
getStateTimestamp
in interfaceAccessExecution
- Parameters:
state
- state for which the timestamp should be returned- Returns:
- timestamp for the given state
-
getStateEndTimestamp
public long getStateEndTimestamp(ExecutionState state)
Description copied from interface:AccessExecution
Returns the end timestamp for the givenExecutionState
.- Specified by:
getStateEndTimestamp
in interfaceAccessExecution
- Parameters:
state
- state for which the timestamp should be returned- Returns:
- timestamp for the given state
-
isFinished
public boolean isFinished()
-
getTaskRestore
@Nullable public JobManagerTaskRestore getTaskRestore()
-
setInitialState
public void setInitialState(JobManagerTaskRestore taskRestore)
Sets the initial state for the execution. The serialized state is then shipped via theTaskDeploymentDescriptor
to the TaskManagers.- Parameters:
taskRestore
- information to restore the state
-
getInitializingOrRunningFuture
public CompletableFuture<?> getInitializingOrRunningFuture()
Gets a future that completes once the task execution reaches one of the statesExecutionState.INITIALIZING
orExecutionState.RUNNING
. If this task never reaches these states (for example because the task is cancelled before it was properly deployed and restored), then this future will never complete.The future is completed already in the
ExecutionState.INITIALIZING
state, because various running actions are already possible in that state (the task already accepts and sends events and network data for task recovery). (Note that in earlier versions, the INITIALIZING state was not separate but part of the RUNNING state).This future is always completed from the job master's main thread.
-
getTerminalStateFuture
public CompletableFuture<ExecutionState> getTerminalStateFuture()
Gets a future that completes once the task execution reaches a terminal state. The future will be completed with specific state that the execution reached. This future is always completed from the job master's main thread.- Specified by:
getTerminalStateFuture
in interfaceLogicalSlot.Payload
- Returns:
- A future which is completed once the execution reaches a terminal state
-
getReleaseFuture
public CompletableFuture<?> getReleaseFuture()
Gets the release future which is completed once the execution reaches a terminal state and the assigned resource has been released. This future is always completed from the job master's main thread.- Returns:
- A future which is completed once the assigned resource has been released
-
registerProducedPartitions
public CompletableFuture<Void> registerProducedPartitions(TaskManagerLocation location)
-
recoverExecution
public void recoverExecution(ExecutionAttemptID attemptId, TaskManagerLocation location, Map<String,org.apache.flink.api.common.accumulators.Accumulator<?,?>> userAccumulators, IOMetrics metrics)
Recover the execution attempt status after JM failover.
-
recoverProducedPartitions
public void recoverProducedPartitions(Map<IntermediateResultPartitionID,ResultPartitionDeploymentDescriptor> producedPartitions)
-
createResultPartitionDeploymentDescriptor
public static ResultPartitionDeploymentDescriptor createResultPartitionDeploymentDescriptor(IntermediateResultPartition partition, ShuffleDescriptor shuffleDescriptor)
-
deploy
public void deploy() throws JobException
Deploys the execution to the previously assigned resource.- Throws:
JobException
- if the execution cannot be deployed to the assigned resource
-
cancel
public void cancel()
-
suspend
public CompletableFuture<?> suspend()
-
fail
public void fail(Throwable t)
This method fails the vertex due to an external condition. The task will move to state FAILED. If the task was in state RUNNING or DEPLOYING before, it will send a cancel call to the TaskManager.- Specified by:
fail
in interfaceLogicalSlot.Payload
- Parameters:
t
- The exception that caused the task to fail.
-
notifyCheckpointOnComplete
public void notifyCheckpointOnComplete(long completedCheckpointId, long completedTimestamp, long lastSubsumedCheckpointId)
Notify the task of this execution about a completed checkpoint and the last subsumed checkpoint id if possible.- Parameters:
completedCheckpointId
- of the completed checkpointcompletedTimestamp
- of the completed checkpointlastSubsumedCheckpointId
- of the last subsumed checkpoint, a value ofCheckpointStoreUtil.INVALID_CHECKPOINT_ID
means no checkpoint has been subsumed.
-
notifyCheckpointAborted
public void notifyCheckpointAborted(long abortCheckpointId, long latestCompletedCheckpointId, long timestamp)
Notify the task of this execution about a aborted checkpoint.- Parameters:
abortCheckpointId
- of the subsumed checkpointlatestCompletedCheckpointId
- of the latest completed checkpointtimestamp
- of the subsumed checkpoint
-
triggerCheckpoint
public CompletableFuture<Acknowledge> triggerCheckpoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions)
Trigger a new checkpoint on the task of this execution.- Parameters:
checkpointId
- of th checkpoint to triggertimestamp
- of the checkpoint to triggercheckpointOptions
- of the checkpoint to trigger- Returns:
- Future acknowledge which is returned once the checkpoint has been triggered
-
triggerSynchronousSavepoint
public CompletableFuture<Acknowledge> triggerSynchronousSavepoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions)
Trigger a new checkpoint on the task of this execution.- Parameters:
checkpointId
- of th checkpoint to triggertimestamp
- of the checkpoint to triggercheckpointOptions
- of the checkpoint to trigger- Returns:
- Future acknowledge which is returned once the checkpoint has been triggered
-
sendOperatorEvent
public CompletableFuture<Acknowledge> sendOperatorEvent(OperatorID operatorId, org.apache.flink.util.SerializedValue<OperatorEvent> event)
Sends the operator event to the Task on the Task Executor.- Returns:
- True, of the message was sent, false is the task is currently not running.
-
markFailed
public void markFailed(Throwable t)
This method marks the task as failed, but will make no attempt to remove task execution from the task manager. It is intended for cases where the task is known not to be running, or then the TaskManager reports failure (in which case it has already removed the task).- Parameters:
t
- The exception that caused the task to fail.
-
markFinished
@VisibleForTesting public void markFinished()
-
transitionState
public void transitionState(ExecutionState targetState)
-
getVertexWithAttempt
public String getVertexWithAttempt()
-
setAccumulators
public void setAccumulators(Map<String,org.apache.flink.api.common.accumulators.Accumulator<?,?>> userAccumulators)
Update accumulators (discarded when the Execution has already been terminated).- Parameters:
userAccumulators
- the user accumulators
-
getUserAccumulators
public Map<String,org.apache.flink.api.common.accumulators.Accumulator<?,?>> getUserAccumulators()
-
getUserAccumulatorsStringified
public StringifiedAccumulatorResult[] getUserAccumulatorsStringified()
Description copied from interface:AccessExecution
Returns the user-defined accumulators as strings.- Specified by:
getUserAccumulatorsStringified
in interfaceAccessExecution
- Returns:
- user-defined accumulators as strings.
-
getParallelSubtaskIndex
public int getParallelSubtaskIndex()
Description copied from interface:AccessExecution
Returns the subtask index of this execution.- Specified by:
getParallelSubtaskIndex
in interfaceAccessExecution
- Returns:
- subtask index of this execution.
-
getIOMetrics
public IOMetrics getIOMetrics()
- Specified by:
getIOMetrics
in interfaceAccessExecution
-
archive
public ArchivedExecution archive()
- Specified by:
archive
in interfaceorg.apache.flink.api.common.Archiveable<ArchivedExecution>
-
-