public class KMeans extends CentroidClustering<double[],double[]>
K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based clustering, and Lloyd's algorithm as a variation of the EM algorithm.
However, the k-means algorithm has at least two major theoretic shortcomings:
We also use k-d trees to speed up each k-means step as described in the filter algorithm by Kanungo, et al.
K-means is a hard clustering method, i.e. each observation is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign observations to different clusters with different probabilities.
centroids, distance, distortionk, OUTLIER, size, y| Constructor and Description |
|---|
KMeans(double distortion,
double[][] centroids,
int[] y)
Constructor.
|
KMeans(double distortion,
double[][] centroids,
int[] y,
java.util.function.ToDoubleBiFunction<double[],double[]> distance)
Constructor.
|
| Modifier and Type | Method and Description |
|---|---|
static KMeans |
fit(BBDTree bbd,
double[][] data,
int k,
int maxIter,
double tol)
Partitions data into k clusters.
|
static KMeans |
fit(double[][] data,
int k)
Partitions data into k clusters up to 100 iterations.
|
static KMeans |
fit(double[][] data,
int k,
int maxIter,
double tol)
Partitions data into k clusters up to 100 iterations.
|
static KMeans |
lloyd(double[][] data,
int k)
The implementation of Lloyd algorithm as a benchmark.
|
static KMeans |
lloyd(double[][] data,
int k,
int maxIter,
double tol)
The implementation of Lloyd algorithm as a benchmark.
|
compareTo, predict, toStringrun, seedpublic KMeans(double distortion,
double[][] centroids,
int[] y)
distortion - the total distortion.centroids - the centroids of each cluster.y - the cluster labels.public KMeans(double distortion,
double[][] centroids,
int[] y,
java.util.function.ToDoubleBiFunction<double[],double[]> distance)
distortion - the total distortion.centroids - the centroids of each cluster.y - the cluster labels.distance - the lambda of distance measure.public static KMeans fit(double[][] data, int k)
data - the input data of which each row is an observation.k - the number of clusters.public static KMeans fit(double[][] data, int k, int maxIter, double tol)
data - the input data of which each row is an observation.k - the number of clusters.maxIter - the maximum number of iterations.tol - the tolerance of convergence test.public static KMeans fit(BBDTree bbd, double[][] data, int k, int maxIter, double tol)
bbd - the BBD-tree of data for fast clustering.data - the input data of which each row is an observation.k - the number of clusters.maxIter - the maximum number of iterations.tol - the tolerance of convergence test.public static KMeans lloyd(double[][] data, int k)
data - the input data of which each row is an observation.k - the number of clusters.public static KMeans lloyd(double[][] data, int k, int maxIter, double tol)
data - the input data of which each row is an observation.k - the number of clusters.maxIter - the maximum number of iterations.tol - the tolerance of convergence test.