public class HAManager extends Object
Handles HA
We compute failover and whether there is a quorum synchronously as we receive nodeAdded and nodeRemoved events from the cluster manager.
It’s vital that this is done synchronously as the cluster manager only guarantees that the set of nodes retrieved from getNodes() is the same for each node in the cluster when processing the exact same nodeAdded/nodeRemoved event.
As HA modules are deployed, if a quorum has been attained they are deployed immediately, otherwise the deployment information is added to a list.
Periodically we check the value of attainedQuorum and if true we deploy any HA deployments waiting for a quorum.
If false, we check if there are any HA deployments current deployed, and if so undeploy them, and add them to the list of deployments waiting for a quorum.
By doing this check periodically we can avoid race conditions resulting in modules being deployed after a quorum has been lost, and without having to resort to exclusive locking which is actually quite tricky here, and prone to deadlock·
We maintain a clustered map where the key is the node id and the value is some stringified JSON which describes the group of the cluster and an array of the HA modules deployed on that node.
There is an entry in the map for each node of the cluster.
When a node joins the cluster or an HA module is deployed or undeployed that entry is updated.
When a node leaves the cluster cleanly, it removes it’s own entry before leaving.
When the cluster manager sends us an event to say a node has left the cluster we check if its entry in the cluster map is there, and if so we infer a clean close has happened and no failover will occur.
If the map entry is there it implies the node died suddenly. In that case each node of the cluster must compute whether it is the failover node for the failed node.
First each node of the cluster determines whether it is in the same group as the failed node, if not then it will not be a candidate for the failover node. Nodes in the cluster only failover to other nodes in the same group.
If the node is in the same group then the node takes the UUID of the failed node, computes the hash-code and chooses a node from the list of nodes in the cluster by taking the hash-code modulo the number of nodes as an index to the list of nodes.
The cluster manager guarantees each node in the cluster sees the same set of nodes for each membership event that is processed. Therefore it is guaranteed that each node in the cluster will compute the same value. It is critical that any cluster manager implementation provides this guarantee!
Once the value has been computed, it is compared to the current node, and if it is the same the current node assumes failover for the failed node.
During failover the failover node deploys all the HA modules from the failed node, as described in the JSON with the same values of config and instances.
Once failover is complete the failover node removes the cluster map entry for the failed node.
If the failover node itself fails while it is processing failover for another node, then this is also checked by other nodes when they detect the failure of the second node.
Constructor and Description |
---|
HAManager(VertxInternal vertx,
DeploymentManager deploymentManager,
ClusterManager clusterManager,
int quorumSize,
String group) |
Modifier and Type | Method and Description |
---|---|
void |
deployVerticle(String verticleName,
DeploymentOptions deploymentOptions,
Handler<AsyncResult<String>> doneHandler) |
void |
failDuringFailover(boolean fail) |
void |
failoverCompleteHandler(Handler<Boolean> failoverCompleteHandler) |
boolean |
isKilled() |
void |
removeFromHA(String depID) |
void |
simulateKill() |
void |
stop() |
public HAManager(VertxInternal vertx, DeploymentManager deploymentManager, ClusterManager clusterManager, int quorumSize, String group)
public void removeFromHA(String depID)
public void deployVerticle(String verticleName, DeploymentOptions deploymentOptions, Handler<AsyncResult<String>> doneHandler)
public void stop()
public void simulateKill()
public void failoverCompleteHandler(Handler<Boolean> failoverCompleteHandler)
public boolean isKilled()
public void failDuringFailover(boolean fail)
Copyright © 2014. All Rights Reserved.