DBSCAN

java.lang.Object
- smile.clustering.PartitionClustering
- - smile.clustering.DBSCAN<T>

Type Parameters:

T - the type of input object.

All Implemented Interfaces:

java.io.Serializable
```
public class DBSCAN<T>
extends PartitionClustering
```
Density-Based Spatial Clustering of Applications with Noise. DBSCAN finds a number of clusters starting from the estimated density distribution of corresponding nodes.
DBSCAN requires two parameters: radius (i.e. neighborhood radius) and the number of minimum points required to form a cluster (minPts). It starts with an arbitrary starting point that has not been visited. This point's neighborhood is retrieved, and if it contains sufficient number of points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized radius-environment of a different point and hence be made part of a cluster.
If a point is found to be part of a cluster, its neighborhood is also part of that cluster. Hence, all points that are found within the neighborhood are added, as is their own neighborhood. This process continues until the cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster of noise.
DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters). For practical considerations, however, the time complexity is mostly governed by the number of nearest neighbor queries. DBSCAN executes exactly one such query for each point, and if an indexing structure is used that executes such a neighborhood query in O(log n), an overall runtime complexity of O(n log n) is obtained.
DBSCAN has many advantages such as
- DBSCAN does not need to know the number of clusters in the data a priori, as opposed to k-means.
- DBSCAN can find arbitrarily shaped clusters. It can even find clusters completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.
- DBSCAN has a notion of noise. Outliers are labeled as Clustering.OUTLIER, which is Integer.MAX_VALUE.
- DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (Only points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.)
On the other hand, DBSCAN has the disadvantages of
- In high dimensional space, the data are sparse everywhere because of the curse of dimensionality. Therefore, DBSCAN doesn't work well on high-dimensional data in general.
- DBSCAN does not respond well to data sets with varying densities.
References
1. Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu (1996-). A density-based algorithm for discovering clusters in large spatial databases with noise". KDD, 1996.
2. Jorg Sander, Martin Ester, Hans-Peter Kriegel, Xiaowei Xu. (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. 1998.
See Also:

Serialized Form

Field Summary

Fields
Modifier and Type Field and Description

double minPts
The minimum number of points required to form a cluster

double radius
The neighborhood radius.
- Fields inherited from class smile.clustering.PartitionClustering
  k, OUTLIER, size, y

Fields
Modifier and Type	Field and Description
`double`	`minPts` The minimum number of points required to form a cluster
`double`	`radius` The neighborhood radius.

Constructor Summary

Constructors
Constructor and Description

DBSCAN(int minPts, double radius, RNNSearch<T,T> nns, int k, int[] y)
Constructor.

Constructors
Constructor and Description
`DBSCAN(int minPts, double radius, RNNSearch<T,T> nns, int k, int[] y)` Constructor.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static DBSCAN<double[]>`	`fit(double[][] data, int minPts, double radius)` Clustering the data with KD-tree.
`static <T> DBSCAN<T>`	`fit(T[] data, smile.math.distance.Distance<T> distance, int minPts, double radius)` Clustering the data.
`static <T> DBSCAN<T>`	`fit(T[] data, RNNSearch<T,T> nns, int minPts, double radius)` Clustering the data.
`int`	`predict(T x)` Classifies a new observation.

Methods inherited from class smile.clustering.PartitionClustering
run, seed, toString

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - minPts
```
public final double minPts
```
    The minimum number of points required to form a cluster
  - radius
```
public final double radius
```
    The neighborhood radius.
- Constructor Detail
  - DBSCAN
```
public DBSCAN(int minPts,
              double radius,
              RNNSearch<T,T> nns,
              int k,
              int[] y)
```
    Constructor.
    
    Parameters:
    
    minPts - the minimum number of neighbors for a core data point.
    
    radius - the neighborhood radius.
    
    nns - the data structure for neighborhood search.
    
    k - the number of clusters.
    
    y - the cluster labels.
- Method Detail
  - fit
```
public static DBSCAN<double[]> fit(double[][] data,
                                   int minPts,
                                   double radius)
```
    Clustering the data with KD-tree. DBSCAN is generally applied on low-dimensional data. Therefore, KD-tree can speed up the nearest neighbor search a lot.
    
    Parameters:
    
    data - the observations.
    
    minPts - the minimum number of neighbors for a core data point.
    
    radius - the neighborhood radius.
  - fit
```
public static <T> DBSCAN<T> fit(T[] data,
                                smile.math.distance.Distance<T> distance,
                                int minPts,
                                double radius)
```
    Clustering the data.
    
    Parameters:
    
    data - the observations.
    
    distance - the distance measure for neighborhood search.
    
    minPts - the minimum number of neighbors for a core data point.
    
    radius - the neighborhood radius.
  - fit
```
public static <T> DBSCAN<T> fit(T[] data,
                                RNNSearch<T,T> nns,
                                int minPts,
                                double radius)
```
    Clustering the data.
    
    Parameters:
    
    data - the observations.
    
    nns - the data structure for neighborhood search.
    
    minPts - the minimum number of neighbors for a core data point.
    
    radius - the neighborhood radius.
  - predict
```
public int predict(T x)
```
    Classifies a new observation.
    
    Parameters:
    
    x - a new observation.
    
    Returns:
    
    the cluster label. Note that it may be PartitionClustering.OUTLIER.

Class DBSCAN<T>

References

Field Summary

Fields inherited from class smile.clustering.PartitionClustering

Constructor Summary

Method Summary

Methods inherited from class smile.clustering.PartitionClustering

Methods inherited from class java.lang.Object

Field Detail

minPts

radius

Constructor Detail

DBSCAN

Method Detail

fit

fit

fit

predict