KMeans

java.lang.Object
- smile.clustering.PartitionClustering
- - smile.clustering.CentroidClustering<double[],double[]>
  - - smile.clustering.KMeans

All Implemented Interfaces:

java.io.Serializable, java.lang.Comparable<CentroidClustering<double[],double[]>>
```
public class KMeans
extends CentroidClustering<double[],double[]>
```
K-Means clustering. The algorithm partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Although finding an exact solution to the k-means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.
K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based clustering, and Lloyd's algorithm as a variation of the EM algorithm.
However, the k-means algorithm has at least two major theoretic shortcomings:
- First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.
- Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal learn. Therefore, it is common to run multiple times with different random initializations.
In this implementation, we use k-means++ which addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.
We also use k-d trees to speed up each k-means step as described in the filter algorithm by Kanungo, et al.
K-means is a hard clustering method, i.e. each observation is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign observations to different clusters with different probabilities.
References
1. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE TRANS. PAMI, 2002.
2. D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
3. Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
See Also:

XMeans, GMeans, CLARANS, SIB, SOM, NeuralGas, BBDTree, Serialized Form

Field Summary
- Fields inherited from class smile.clustering.CentroidClustering
  centroids, distortion
- Fields inherited from class smile.clustering.PartitionClustering
  k, OUTLIER, size, y

Constructor Summary

Constructors
Constructor and Description

KMeans(double distortion, double[][] centroids, int[] y)
Constructor.

Constructors
Constructor and Description
`KMeans(double distortion, double[][] centroids, int[] y)` Constructor.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`double`	`distance(double[] x, double[] y)` The distance function.
`static KMeans`	`fit(BBDTree bbd, double[][] data, int k, int maxIter, double tol)` Partitions data into k clusters.
`static KMeans`	`fit(double[][] data, int k)` Partitions data into k clusters up to 100 iterations.
`static KMeans`	`fit(double[][] data, int k, int maxIter, double tol)` Partitions data into k clusters up to 100 iterations.
`static KMeans`	`lloyd(double[][] data, int k)` The implementation of Lloyd algorithm as a benchmark.
`static KMeans`	`lloyd(double[][] data, int k, int maxIter, double tol)` The implementation of Lloyd algorithm as a benchmark.

Methods inherited from class smile.clustering.CentroidClustering
compareTo, predict, toString

Methods inherited from class smile.clustering.PartitionClustering
run, seed

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - KMeans
```
public KMeans(double distortion,
              double[][] centroids,
              int[] y)
```
    Constructor.
    
    Parameters:
    
    distortion - the total distortion.
    
    centroids - the centroids of each cluster.
    
    y - the cluster labels.
- Method Detail
  - distance
```
public double distance(double[] x,
                       double[] y)
```
    Description copied from class: CentroidClustering
    
    The distance function.
    
    Specified by:
    
    distance in class CentroidClustering<double[],double[]>
  - fit
```
public static KMeans fit(double[][] data,
                         int k)
```
    Partitions data into k clusters up to 100 iterations.
    
    Parameters:
    
    data - the input data of which each row is an observation.
    
    k - the number of clusters.
  - fit
```
public static KMeans fit(double[][] data,
                         int k,
                         int maxIter,
                         double tol)
```
    Partitions data into k clusters up to 100 iterations.
    
    Parameters:
    
    data - the input data of which each row is an observation.
    
    k - the number of clusters.
    
    maxIter - the maximum number of iterations.
    
    tol - the tolerance of convergence test.
  - fit
```
public static KMeans fit(BBDTree bbd,
                         double[][] data,
                         int k,
                         int maxIter,
                         double tol)
```
    Partitions data into k clusters.
    
    Parameters:
    
    bbd - the BBD-tree of data for fast clustering.
    
    data - the input data of which each row is an observation.
    
    k - the number of clusters.
    
    maxIter - the maximum number of iterations.
    
    tol - the tolerance of convergence test.
  - lloyd
```
public static KMeans lloyd(double[][] data,
                           int k)
```
    The implementation of Lloyd algorithm as a benchmark. The data may contain missing values (i.e. Double.NaN). The algorithm runs up to 100 iterations.
    
    Parameters:
    
    data - the input data of which each row is an observation.
    
    k - the number of clusters.
  - lloyd
```
public static KMeans lloyd(double[][] data,
                           int k,
                           int maxIter,
                           double tol)
```
    The implementation of Lloyd algorithm as a benchmark. The data may contain missing values (i.e. Double.NaN).
    
    Parameters:
    
    data - the input data of which each row is an observation.
    
    k - the number of clusters.
    
    maxIter - the maximum number of iterations.
    
    tol - the tolerance of convergence test.

Class KMeans

References

Field Summary

Fields inherited from class smile.clustering.CentroidClustering

Fields inherited from class smile.clustering.PartitionClustering

Constructor Summary

Method Summary

Methods inherited from class smile.clustering.CentroidClustering

Methods inherited from class smile.clustering.PartitionClustering

Methods inherited from class java.lang.Object

Constructor Detail

KMeans

Method Detail

distance

fit

fit

fit

lloyd

lloyd