public class SIB extends CentroidClustering<double[],smile.util.SparseArray>
In analogy to K-Means, SIB's update formulas are essentially same as the EM algorithm for estimating finite Gaussian mixture model by replacing regular Euclidean distance with Kullback-Leibler divergence, which is clearly a better dissimilarity measure for co-occurrence data. However, the common batch updating rule (assigning all instances to nearest centroids and then updating centroids) of K-Means won't work in SIB, which has to work in a sequential way (reassigning (if better) each instance then immediately update related centroids). It might be because K-L divergence is very sensitive and the centroids may be significantly changed in each iteration in batch updating rule.
Note that this implementation has a little difference from the original paper, in which a weighted Jensen-Shannon divergence is employed as a criterion to assign a randomly-picked sample to a different cluster. However, this doesn't work well in some cases as we experienced probably because the weighted JS divergence gives too much weight to clusters which is much larger than a single sample. In this implementation, we instead use the regular/unweighted Jensen-Shannon divergence.
centroids, distance, distortion
k, OUTLIER, size, y
Constructor and Description |
---|
SIB(double distortion,
double[][] centroids,
int[] y)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
static SIB |
fit(smile.util.SparseArray[] data,
int k)
Clustering data into k clusters up to 100 iterations.
|
static SIB |
fit(smile.util.SparseArray[] data,
int k,
int maxIter)
Clustering data into k clusters.
|
compareTo, predict, toString
run, seed
public static SIB fit(smile.util.SparseArray[] data, int k)
data
- the sparse normalized co-occurrence dataset of which each
row is an observation of which the sum is 1.k
- the number of clusters.public static SIB fit(smile.util.SparseArray[] data, int k, int maxIter)
data
- the sparse normalized co-occurrence dataset of which each
row is an observation of which the sum is 1.k
- the number of clusters.maxIter
- the maximum number of iterations.