Class Execution

  • All Implemented Interfaces:
    org.apache.flink.api.common.Archiveable<ArchivedExecution>, AccessExecution, LogicalSlot.Payload

    public class Execution
    extends Object
    implements AccessExecution, org.apache.flink.api.common.Archiveable<ArchivedExecution>, LogicalSlot.Payload
    A single execution of a vertex. While an ExecutionVertex can be executed multiple times (for recovery, re-computation, re-configuration), this class tracks the state of a single execution of that vertex and the resources.

    Lock free state transitions

    In several points of the code, we need to deal with possible concurrent state changes and actions. For example, while the call to deploy a task (send it to the TaskManager) happens, the task gets cancelled.

    We could lock the entire portion of the code (decision to deploy, deploy, set state to running) such that it is guaranteed that any "cancel command" will only pick up after deployment is done and that the "cancel command" call will never overtake the deploying call.

    This blocks the threads big time, because the remote calls may take long. Depending of their locking behavior, it may even result in distributed deadlocks (unless carefully avoided). We therefore use atomic state updates and occasional double-checking to ensure that the state after a completed call is as expected, and trigger correcting actions if it is not. Many actions are also idempotent (like canceling).

    • Constructor Detail

      • Execution

        public Execution​(Executor executor,
                         ExecutionVertex vertex,
                         int attemptNumber,
                         long startTimestamp,
                         Duration rpcTimeout)
        Creates a new Execution attempt.
        Parameters:
        executor - The executor used to dispatch callbacks from futures and asynchronous RPC calls.
        vertex - The execution vertex to which this Execution belongs
        attemptNumber - The execution attempt number.
        startTimestamp - The timestamp that marks the creation of this Execution
        rpcTimeout - The rpcTimeout for RPC calls like deploy/cancel/stop.
    • Method Detail

      • getAttemptNumber

        public int getAttemptNumber()
        Description copied from interface: AccessExecution
        Returns the attempt number for this execution.
        Specified by:
        getAttemptNumber in interface AccessExecution
        Returns:
        attempt number for this execution.
      • getAssignedAllocationID

        @Nullable
        public AllocationID getAssignedAllocationID()
      • getAssignedResource

        public LogicalSlot getAssignedResource()
      • tryAssignResource

        public boolean tryAssignResource​(LogicalSlot logicalSlot)
        Tries to assign the given slot to the execution. The assignment works only if the Execution is in state SCHEDULED. Returns true, if the resource could be assigned.
        Parameters:
        logicalSlot - to assign to this execution
        Returns:
        true if the slot could be assigned to the execution, otherwise false
      • getNextInputSplit

        public Optional<org.apache.flink.core.io.InputSplit> getNextInputSplit()
      • getFailureInfo

        public Optional<ErrorInfo> getFailureInfo()
        Description copied from interface: AccessExecution
        Returns the exception that caused the job to fail. This is the first root exception that was not recoverable and triggered job failure.
        Specified by:
        getFailureInfo in interface AccessExecution
        Returns:
        an Optional of ErrorInfo containing the Throwable and the time it was registered if an error occurred. If no error occurred an empty Optional will be returned.
      • isFinished

        public boolean isFinished()
      • setInitialState

        public void setInitialState​(JobManagerTaskRestore taskRestore)
        Sets the initial state for the execution. The serialized state is then shipped via the TaskDeploymentDescriptor to the TaskManagers.
        Parameters:
        taskRestore - information to restore the state
      • getInitializingOrRunningFuture

        public CompletableFuture<?> getInitializingOrRunningFuture()
        Gets a future that completes once the task execution reaches one of the states ExecutionState.INITIALIZING or ExecutionState.RUNNING. If this task never reaches these states (for example because the task is cancelled before it was properly deployed and restored), then this future will never complete.

        The future is completed already in the ExecutionState.INITIALIZING state, because various running actions are already possible in that state (the task already accepts and sends events and network data for task recovery). (Note that in earlier versions, the INITIALIZING state was not separate but part of the RUNNING state).

        This future is always completed from the job master's main thread.

      • getTerminalStateFuture

        public CompletableFuture<ExecutionState> getTerminalStateFuture()
        Gets a future that completes once the task execution reaches a terminal state. The future will be completed with specific state that the execution reached. This future is always completed from the job master's main thread.
        Specified by:
        getTerminalStateFuture in interface LogicalSlot.Payload
        Returns:
        A future which is completed once the execution reaches a terminal state
      • getReleaseFuture

        public CompletableFuture<?> getReleaseFuture()
        Gets the release future which is completed once the execution reaches a terminal state and the assigned resource has been released. This future is always completed from the job master's main thread.
        Returns:
        A future which is completed once the assigned resource has been released
      • recoverExecution

        public void recoverExecution​(ExecutionAttemptID attemptId,
                                     TaskManagerLocation location,
                                     Map<String,​org.apache.flink.api.common.accumulators.Accumulator<?,​?>> userAccumulators,
                                     IOMetrics metrics)
        Recover the execution attempt status after JM failover.
      • deploy

        public void deploy()
                    throws JobException
        Deploys the execution to the previously assigned resource.
        Throws:
        JobException - if the execution cannot be deployed to the assigned resource
      • cancel

        public void cancel()
      • fail

        public void fail​(Throwable t)
        This method fails the vertex due to an external condition. The task will move to state FAILED. If the task was in state RUNNING or DEPLOYING before, it will send a cancel call to the TaskManager.
        Specified by:
        fail in interface LogicalSlot.Payload
        Parameters:
        t - The exception that caused the task to fail.
      • notifyCheckpointOnComplete

        public void notifyCheckpointOnComplete​(long completedCheckpointId,
                                               long completedTimestamp,
                                               long lastSubsumedCheckpointId)
        Notify the task of this execution about a completed checkpoint and the last subsumed checkpoint id if possible.
        Parameters:
        completedCheckpointId - of the completed checkpoint
        completedTimestamp - of the completed checkpoint
        lastSubsumedCheckpointId - of the last subsumed checkpoint, a value of CheckpointStoreUtil.INVALID_CHECKPOINT_ID means no checkpoint has been subsumed.
      • notifyCheckpointAborted

        public void notifyCheckpointAborted​(long abortCheckpointId,
                                            long latestCompletedCheckpointId,
                                            long timestamp)
        Notify the task of this execution about a aborted checkpoint.
        Parameters:
        abortCheckpointId - of the subsumed checkpoint
        latestCompletedCheckpointId - of the latest completed checkpoint
        timestamp - of the subsumed checkpoint
      • triggerCheckpoint

        public CompletableFuture<Acknowledge> triggerCheckpoint​(long checkpointId,
                                                                long timestamp,
                                                                CheckpointOptions checkpointOptions)
        Trigger a new checkpoint on the task of this execution.
        Parameters:
        checkpointId - of th checkpoint to trigger
        timestamp - of the checkpoint to trigger
        checkpointOptions - of the checkpoint to trigger
        Returns:
        Future acknowledge which is returned once the checkpoint has been triggered
      • triggerSynchronousSavepoint

        public CompletableFuture<Acknowledge> triggerSynchronousSavepoint​(long checkpointId,
                                                                          long timestamp,
                                                                          CheckpointOptions checkpointOptions)
        Trigger a new checkpoint on the task of this execution.
        Parameters:
        checkpointId - of th checkpoint to trigger
        timestamp - of the checkpoint to trigger
        checkpointOptions - of the checkpoint to trigger
        Returns:
        Future acknowledge which is returned once the checkpoint has been triggered
      • sendOperatorEvent

        public CompletableFuture<Acknowledge> sendOperatorEvent​(OperatorID operatorId,
                                                                org.apache.flink.util.SerializedValue<OperatorEvent> event)
        Sends the operator event to the Task on the Task Executor.
        Returns:
        True, of the message was sent, false is the task is currently not running.
      • markFailed

        public void markFailed​(Throwable t)
        This method marks the task as failed, but will make no attempt to remove task execution from the task manager. It is intended for cases where the task is known not to be running, or then the TaskManager reports failure (in which case it has already removed the task).
        Parameters:
        t - The exception that caused the task to fail.
      • markFinished

        @VisibleForTesting
        public void markFinished()
      • transitionState

        public void transitionState​(ExecutionState targetState)
      • getVertexWithAttempt

        public String getVertexWithAttempt()
      • setAccumulators

        public void setAccumulators​(Map<String,​org.apache.flink.api.common.accumulators.Accumulator<?,​?>> userAccumulators)
        Update accumulators (discarded when the Execution has already been terminated).
        Parameters:
        userAccumulators - the user accumulators
      • getUserAccumulators

        public Map<String,​org.apache.flink.api.common.accumulators.Accumulator<?,​?>> getUserAccumulators()
      • getParallelSubtaskIndex

        public int getParallelSubtaskIndex()
        Description copied from interface: AccessExecution
        Returns the subtask index of this execution.
        Specified by:
        getParallelSubtaskIndex in interface AccessExecution
        Returns:
        subtask index of this execution.