Implements a soft-margin SVM using the communication-efficient distributed dual coordinate
ascent algorithm (CoCoA) with hinge-loss function.
Implements a soft-margin SVM using the communication-efficient distributed dual coordinate
ascent algorithm (CoCoA) with hinge-loss function.
It can be used for binary classification problems, with the labels set as +1.0 to indicate a
positive example and -1.0 to indicate a negative example.
The algorithm solves the following minimization problem:
min_{w in bbb"R"d} lambda/2 ||w||2 + 1/n sum_(i=1)n l_{i}(wTx_i)
with w being the weight vector, lambda being the regularization constant,
x_{i} in bbb"R"^d being the data points and l_{i} being the convex loss functions, which
can also depend on the labels y_{i} in bbb"R".
In the current implementation the regularizer is the 2-norm and the loss functions are the
hinge-loss functions:
l_{i} = max(0, 1 - y_{i} * w^Tx_i
With these choices, the problem definition is equivalent to a SVM with soft-margin.
Thus, the algorithm allows us to train a SVM with soft-margin.
The minimization problem is solved by applying stochastic dual coordinate ascent (SDCA).
In order to make the algorithm efficient in a distributed setting, the CoCoA algorithm
calculates several iterations of SDCA locally on a data block before merging the local
updates into a valid global state.
This state is redistributed to the different data partitions where the next round of local
SDCA iterations is then executed.
The number of outer iterations and local SDCA iterations control the overall network costs,
because there is only network communication required for each outer iteration.
The local SDCA iterations are embarrassingly parallel once the individual data partitions have
been distributed across the cluster.
Further details of the algorithm can be found here.
Example:
val trainingDS: DataSet[LabeledVector] = env.readLibSVM(pathToTrainingFile)
val svm = SVM()
.setBlocks(10)
svm.fit(trainingDS)
val testingDS: DataSet[Vector] = env.readLibSVM(pathToTestingFile)
.map(lv => lv.vector)
val predictionDS: DataSet[(Vector, Double)] = svm.predict(testingDS)
Parameters
org.apache.flink.ml.classification.SVM.Blocks:
Sets the number of blocks into which the input data will be split. On each block the local
stochastic dual coordinate ascent method is executed. This number should be set at least to
the degree of parallelism. If no value is specified, then the parallelism of the input
DataSet is used as the number of blocks. (Default value: None)
org.apache.flink.ml.classification.SVM.Iterations:
Defines the maximum number of iterations of the outer loop method. In other words, it defines
how often the SDCA method is applied to the blocked data. After each iteration, the locally
computed weight vector updates have to be reduced to update the global weight vector value.
The new weight vector is broadcast to all SDCA tasks at the beginning of each iteration.
(Default value: 10)
org.apache.flink.ml.classification.SVM.LocalIterations:
Defines the maximum number of SDCA iterations. In other words, it defines how many data points
are drawn from each local data block to calculate the stochastic dual coordinate ascent.
(Default value: 10)
org.apache.flink.ml.classification.SVM.Regularization:
Defines the regularization constant of the SVM algorithm. The higher the value, the smaller
will the 2-norm of the weight vector be. In case of a SVM with hinge loss this means that the
SVM margin will be wider even though it might contain some false classifications.
(Default value: 1.0)
org.apache.flink.ml.classification.SVM.Stepsize:
Defines the initial step size for the updates of the weight vector. The larger the step size
is, the larger will be the contribution of the weight vector updates to the next weight vector
value. The effective scaling of the updates is stepsize/blocks. This value has to be tuned
in case that the algorithm becomes instable. (Default value: 1.0)
org.apache.flink.ml.classification.SVM.Seed:
Defines the seed to initialize the random number generator. The seed directly controls which
data points are chosen for the SDCA method. (Default value: Random value)
org.apache.flink.ml.classification.SVM.OutputDecisionFunction:
Determines whether the predict and evaluate functions of the SVM should return the distance
to the separating hyperplane, or binary class labels. Setting this to true will return the raw
distance to the hyperplane for each example. Setting it to false will return the binary
class label (+1.0, -1.0) (Default value: false)
Implements a soft-margin SVM using the communication-efficient distributed dual coordinate ascent algorithm (CoCoA) with hinge-loss function.
It can be used for binary classification problems, with the labels set as +1.0 to indicate a positive example and -1.0 to indicate a negative example.
The algorithm solves the following minimization problem:
min_{w in bbb"R"d} lambda/2 ||w||2 + 1/n sum_(i=1)n l_{i}(wTx_i)
with
w
being the weight vector,lambda
being the regularization constant,x_{i} in bbb"R"^d
being the data points and
l_{i}being the convex loss functions, which can also depend on the labels
y_{i} in bbb"R". In the current implementation the regularizer is the 2-norm and the loss functions are the hinge-loss functions:
l_{i} = max(0, 1 - y_{i} * w^Tx_i
With these choices, the problem definition is equivalent to a SVM with soft-margin. Thus, the algorithm allows us to train a SVM with soft-margin.
The minimization problem is solved by applying stochastic dual coordinate ascent (SDCA). In order to make the algorithm efficient in a distributed setting, the CoCoA algorithm calculates several iterations of SDCA locally on a data block before merging the local updates into a valid global state. This state is redistributed to the different data partitions where the next round of local SDCA iterations is then executed. The number of outer iterations and local SDCA iterations control the overall network costs, because there is only network communication required for each outer iteration. The local SDCA iterations are embarrassingly parallel once the individual data partitions have been distributed across the cluster.
Further details of the algorithm can be found here.
Parameters
stepsize/blocks
. This value has to be tuned in case that the algorithm becomes instable. (Default value: 1.0)