public class SIB extends PartitionClustering<double[]>
In analogy to K-Means, SIB's update formulas are essentially same as the EM algorithm for estimating finite Gaussian mixture model by replacing regular Euclidean distance with Kullback-Leibler divergence, which is clearly a better dissimilarity measure for co-occurrence data. However, the common batch updating rule (assigning all instances to nearest centroids and then updating centroids) of K-Means won't work in SIB, which has to work in a sequential way (reassigning (if better) each instance then immediately update related centroids). It might be because K-L divergence is very sensitive and the centroids may be significantly changed in each iteration in batch updating rule.
Note that this implementation has a little difference from the original paper, in which a weighted Jensen-Shannon divergence is employed as a criterion to assign a randomly-picked sample to a different cluster. However, this doesn't work well in some cases as we experienced probably because the weighted JS divergence gives too much weight to clusters which is much larger than a single sample. In this implementation, we instead use the regular/unweighted Jensen-Shannon divergence.
k, size, y
OUTLIER
Constructor and Description |
---|
SIB(double[][] data,
int k)
Constructor.
|
SIB(double[][] data,
int k,
int maxIter)
Constructor.
|
SIB(double[][] data,
int k,
int maxIter,
int runs)
Constructor.
|
SIB(smile.data.SparseDataset data,
int k)
Constructor.
|
SIB(smile.data.SparseDataset data,
int k,
int maxIter)
Constructor.
|
SIB(smile.data.SparseDataset data,
int k,
int maxIter,
int runs)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
double[][] |
centroids()
Returns the centroids.
|
double |
distortion()
Returns the distortion.
|
int |
predict(double[] x)
Cluster a new instance.
|
int |
predict(smile.math.SparseArray x)
Cluster a new instance.
|
java.lang.String |
toString() |
getClusterLabel, getClusterSize, getNumClusters, seed, seed
public SIB(double[][] data, int k)
data
- the normalized co-occurrence input data of which each row
is a sample with sum 1.k
- the number of clusters.public SIB(double[][] data, int k, int maxIter)
data
- the input data of which each row is a sample.k
- the number of clusters.maxIter
- the maximum number of iterations.public SIB(double[][] data, int k, int maxIter, int runs)
data
- the input data of which each row is a sample.k
- the number of clusters.maxIter
- the maximum number of iterations.runs
- the number of runs of SIB algorithm.public SIB(smile.data.SparseDataset data, int k)
data
- the sparse normalized co-occurrence dataset of which each row
is a sample with sum 1.k
- the number of clusters.public SIB(smile.data.SparseDataset data, int k, int maxIter)
data
- the sparse normalized co-occurrence dataset of which each row
is a sample with sum 1.k
- the number of clusters.maxIter
- the maximum number of iterations.public SIB(smile.data.SparseDataset data, int k, int maxIter, int runs)
data
- the sparse normalized co-occurrence dataset of which each row
is a sample with sum 1.k
- the number of clusters.maxIter
- the maximum number of iterations.runs
- the number of runs of SIB algorithm.public int predict(double[] x)
x
- a new instance.public int predict(smile.math.SparseArray x)
x
- a new instance.public double distortion()
public double[][] centroids()
public java.lang.String toString()
toString
in class java.lang.Object