public class KMeans extends CentroidClustering<double[],double[]>
K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based clustering, and Lloyd's algorithm as a variation of the EM algorithm.
However, the k-means algorithm has at least two major theoretic shortcomings:
We also use k-d trees to speed up each k-means step as described in the filter algorithm by Kanungo, et al.
K-means is a hard clustering method, i.e. each observation is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign observations to different clusters with different probabilities.
centroids, distortion
k, OUTLIER, size, y
Constructor and Description |
---|
KMeans(double distortion,
double[][] centroids,
int[] y)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
double |
distance(double[] x,
double[] y)
The distance function.
|
static KMeans |
fit(BBDTree bbd,
double[][] data,
int k,
int maxIter,
double tol)
Partitions data into k clusters.
|
static KMeans |
fit(double[][] data,
int k)
Partitions data into k clusters up to 100 iterations.
|
static KMeans |
fit(double[][] data,
int k,
int maxIter,
double tol)
Partitions data into k clusters up to 100 iterations.
|
static KMeans |
lloyd(double[][] data,
int k)
The implementation of Lloyd algorithm as a benchmark.
|
static KMeans |
lloyd(double[][] data,
int k,
int maxIter,
double tol)
The implementation of Lloyd algorithm as a benchmark.
|
compareTo, predict, toString
run, seed
public KMeans(double distortion, double[][] centroids, int[] y)
distortion
- the total distortion.centroids
- the centroids of each cluster.y
- the cluster labels.public double distance(double[] x, double[] y)
CentroidClustering
distance
in class CentroidClustering<double[],double[]>
public static KMeans fit(double[][] data, int k)
data
- the input data of which each row is an observation.k
- the number of clusters.public static KMeans fit(double[][] data, int k, int maxIter, double tol)
data
- the input data of which each row is an observation.k
- the number of clusters.maxIter
- the maximum number of iterations.tol
- the tolerance of convergence test.public static KMeans fit(BBDTree bbd, double[][] data, int k, int maxIter, double tol)
bbd
- the BBD-tree of data for fast clustering.data
- the input data of which each row is an observation.k
- the number of clusters.maxIter
- the maximum number of iterations.tol
- the tolerance of convergence test.public static KMeans lloyd(double[][] data, int k)
data
- the input data of which each row is an observation.k
- the number of clusters.public static KMeans lloyd(double[][] data, int k, int maxIter, double tol)
data
- the input data of which each row is an observation.k
- the number of clusters.maxIter
- the maximum number of iterations.tol
- the tolerance of convergence test.