RandomForest

java.lang.Object
- smile.classification.RandomForest

All Implemented Interfaces:

java.io.Serializable, Classifier<double[]>, SoftClassifier<double[]>
```
public class RandomForest
extends java.lang.Object
implements SoftClassifier<double[]>
```
Random forest for classification. Random forest is an ensemble classifier that consists of many decision trees and outputs the majority vote of individual trees. The method combines bagging idea and the random selection of features.
Each tree is constructed using the following algorithm:
1. If the number of cases in the training set is N, randomly sample N cases with replacement from the original data. This sample will be the training set for growing the tree.
2. If there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
The advantages of random forest are:
- For many data sets, it produces a highly accurate classifier.
- It runs efficiently on large data sets.
- It can handle thousands of input variables without variable deletion.
- It gives estimates of what variables are important in the classification.
- It generates an internal unbiased estimate of the generalization error as the forest building progresses.
- It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
The disadvantages are
- Random forests are prone to over-fitting for some datasets. This is even more pronounced on noisy data.
- For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.
See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class RandomForest.Trainer
Trainer for random forest classifiers.

Nested Classes
Modifier and Type	Class and Description
`static class`	`RandomForest.Trainer` Trainer for random forest classifiers.

Constructor Summary

Constructors
Constructor and Description
`RandomForest(smile.data.Attribute[] attributes, double[][] x, int[] y, int ntrees)` Constructor.
`RandomForest(smile.data.Attribute[] attributes, double[][] x, int[] y, int ntrees, int mtry)` Constructor.
`RandomForest(smile.data.Attribute[] attributes, double[][] x, int[] y, int ntrees, int maxNodes, int nodeSize, int mtry, double subsample)` Constructor.
`RandomForest(smile.data.Attribute[] attributes, double[][] x, int[] y, int ntrees, int maxNodes, int nodeSize, int mtry, double subsample, DecisionTree.SplitRule rule)` Constructor.
`RandomForest(smile.data.Attribute[] attributes, double[][] x, int[] y, int ntrees, int maxNodes, int nodeSize, int mtry, double subsample, DecisionTree.SplitRule rule, int[] classWeight)` Constructor.
`RandomForest(smile.data.AttributeDataset data, int ntrees)` Constructor.
`RandomForest(smile.data.AttributeDataset data, int ntrees, int mtry)` Constructor.
`RandomForest(smile.data.AttributeDataset data, int ntrees, int maxNodes, int nodeSize, int mtry, double subsample, DecisionTree.SplitRule rule)` Constructor.
`RandomForest(smile.data.AttributeDataset data, int ntrees, int maxNodes, int nodeSize, int mtry, double subsample, DecisionTree.SplitRule rule, int[] classWeight)` Constructor.
`RandomForest(double[][] x, int[] y, int ntrees)` Constructor.
`RandomForest(double[][] x, int[] y, int ntrees, int mtry)` Constructor.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`double`	`error()` Returns the out-of-bag estimation of error rate.
`DecisionTree[]`	`getTrees()` Returns the decision trees.
`double[]`	`importance()` Returns the variable importance.
`int`	`predict(double[] x)` Predicts the class label of an instance.
`int`	`predict(double[] x, double[] posteriori)` Predicts the class label of an instance and also calculate a posteriori probabilities.
`int`	`size()` Returns the number of trees in the model.
`double[]`	`test(double[][] x, int[] y)` Test the model on a validation dataset.
`double[][]`	`test(double[][] x, int[] y, ClassificationMeasure[] measures)` Test the model on a validation dataset.
`void`	`trim(int ntrees)` Trims the tree model set to a smaller size in case of over-fitting.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface smile.classification.Classifier
predict

- Constructor Detail
  - RandomForest
```
public RandomForest(double[][] x,
                    int[] y,
                    int ntrees)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    x - the training instances.
    
    y - the response variable.
    
    ntrees - the number of trees.
  - RandomForest
```
public RandomForest(double[][] x,
                    int[] y,
                    int ntrees,
                    int mtry)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    x - the training instances.
    
    y - the response variable.
    
    ntrees - the number of trees.
    
    mtry - the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
  - RandomForest
```
public RandomForest(smile.data.Attribute[] attributes,
                    double[][] x,
                    int[] y,
                    int ntrees)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    attributes - the attribute properties.
    
    x - the training instances.
    
    y - the response variable.
    
    ntrees - the number of trees.
  - RandomForest
```
public RandomForest(smile.data.AttributeDataset data,
                    int ntrees)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    data - the dataset
    
    ntrees - the number of trees. generally good performance, where dim is the number of variables.
  - RandomForest
```
public RandomForest(smile.data.Attribute[] attributes,
                    double[][] x,
                    int[] y,
                    int ntrees,
                    int mtry)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    attributes - the attribute properties.
    
    x - the training instances.
    
    y - the response variable.
    
    ntrees - the number of trees.
    
    mtry - the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
  - RandomForest
```
public RandomForest(smile.data.AttributeDataset data,
                    int ntrees,
                    int mtry)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    data - the dataset
    
    ntrees - the number of trees.
    
    mtry - the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
  - RandomForest
```
public RandomForest(smile.data.Attribute[] attributes,
                    double[][] x,
                    int[] y,
                    int ntrees,
                    int maxNodes,
                    int nodeSize,
                    int mtry,
                    double subsample)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    attributes - the attribute properties.
    
    x - the training instances.
    
    y - the response variable.
    
    ntrees - the number of trees.
    
    mtry - the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
    
    nodeSize - the minimum size of leaf nodes.
    
    maxNodes - the maximum number of leaf nodes in the tree.
    
    subsample - the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
  - RandomForest
```
public RandomForest(smile.data.AttributeDataset data,
                    int ntrees,
                    int maxNodes,
                    int nodeSize,
                    int mtry,
                    double subsample,
                    DecisionTree.SplitRule rule)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    data - the dataset
    
    ntrees - the number of trees.
    
    mtry - the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
    
    nodeSize - the minimum size of leaf nodes.
    
    maxNodes - the maximum number of leaf nodes in the tree.
    
    subsample - the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
    
    rule - Decision tree split rule.
  - RandomForest
```
public RandomForest(smile.data.Attribute[] attributes,
                    double[][] x,
                    int[] y,
                    int ntrees,
                    int maxNodes,
                    int nodeSize,
                    int mtry,
                    double subsample,
                    DecisionTree.SplitRule rule)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    attributes - the attribute properties.
    
    x - the training instances.
    
    y - the response variable.
    
    ntrees - the number of trees.
    
    mtry - the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
    
    nodeSize - the minimum size of leaf nodes.
    
    maxNodes - the maximum number of leaf nodes in the tree.
    
    subsample - the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
    
    rule - Decision tree split rule.
  - RandomForest
```
public RandomForest(smile.data.AttributeDataset data,
                    int ntrees,
                    int maxNodes,
                    int nodeSize,
                    int mtry,
                    double subsample,
                    DecisionTree.SplitRule rule,
                    int[] classWeight)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    data - the dataset
    
    ntrees - the number of trees.
    
    mtry - the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
    
    nodeSize - the minimum size of leaf nodes.
    
    maxNodes - the maximum number of leaf nodes in the tree.
    
    subsample - the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
    
    rule - Decision tree split rule.
    
    classWeight - Priors of the classes. The weight of each class is roughly the ratio of samples in each class. For example, if there are 400 positive samples and 100 negative samples, the classWeight should be [1, 4] (assuming label 0 is of negative, label 1 is of positive).
  - RandomForest
```
public RandomForest(smile.data.Attribute[] attributes,
                    double[][] x,
                    int[] y,
                    int ntrees,
                    int maxNodes,
                    int nodeSize,
                    int mtry,
                    double subsample,
                    DecisionTree.SplitRule rule,
                    int[] classWeight)
```
    Constructor. Learns a random forest for classification.
    
    Parameters:
    
    attributes - the attribute properties.
    
    x - the training instances.
    
    y - the response variable.
    
    ntrees - the number of trees.
    
    mtry - the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
    
    nodeSize - the minimum size of leaf nodes.
    
    maxNodes - the maximum number of leaf nodes in the tree.
    
    subsample - the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
    
    rule - Decision tree split rule.
    
    classWeight - Priors of the classes. The weight of each class is roughly the ratio of samples in each class. For example, if there are 400 positive samples and 100 negative samples, the classWeight should be [1, 4] (assuming label 0 is of negative, label 1 is of positive).
- Method Detail
  - error
```
public double error()
```
    Returns the out-of-bag estimation of error rate. The OOB estimate is quite accurate given that enough trees have been grown. Otherwise the OOB estimate can bias upward.
    
    Returns:
    
    the out-of-bag estimation of error rate
  - importance
```
public double[] importance()
```
    Returns the variable importance. Every time a split of a node is made on variable the (GINI, information gain, etc.) impurity criterion for the two descendent nodes is less than the parent node. Adding up the decreases for each individual variable over all trees in the forest gives a fast measure of variable importance that is often very consistent with the permutation importance measure.
    
    Returns:
    
    the variable importance
  - size
```
public int size()
```
    Returns the number of trees in the model.
    
    Returns:
    
    the number of trees in the model
  - trim
```
public void trim(int ntrees)
```
    Trims the tree model set to a smaller size in case of over-fitting. Or if extra decision trees in the model don't improve the performance, we may remove them to reduce the model size and also improve the speed of prediction.
    
    Parameters:
    
    ntrees - the new (smaller) size of tree model set.
  - predict
```
public int predict(double[] x)
```
    Description copied from interface: Classifier
    
    Predicts the class label of an instance.
    
    Specified by:
    
    predict in interface Classifier<double[]>
    
    Parameters:
    
    x - the instance to be classified.
    
    Returns:
    
    the predicted class label.
  - predict
```
public int predict(double[] x,
                   double[] posteriori)
```
    Description copied from interface: SoftClassifier
    
    Predicts the class label of an instance and also calculate a posteriori probabilities. Classifiers may NOT support this method since not all classification algorithms are able to calculate such a posteriori probabilities.
    
    Specified by:
    
    predict in interface SoftClassifier<double[]>
    
    Parameters:
    
    x - the instance to be classified.
    
    posteriori - the array to store a posteriori probabilities on output.
    
    Returns:
    
    the predicted class label
  - test
```
public double[] test(double[][] x,
                     int[] y)
```
    Test the model on a validation dataset.
    
    Parameters:
    
    x - the test data set.
    
    y - the test data response values.
    
    Returns:
    
    accuracies with first 1, 2, ..., decision trees.
  - test
```
public double[][] test(double[][] x,
                       int[] y,
                       ClassificationMeasure[] measures)
```
    Test the model on a validation dataset.
    
    Parameters:
    
    x - the test data set.
    
    y - the test data labels.
    
    measures - the performance measures of classification.
    
    Returns:
    
    performance measures with first 1, 2, ..., decision trees.
  - getTrees
```
public DecisionTree[] getTrees()
```
    Returns the decision trees.

Class RandomForest

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface smile.classification.Classifier

Constructor Detail

RandomForest

RandomForest

RandomForest

RandomForest

RandomForest

RandomForest

RandomForest

RandomForest

RandomForest

RandomForest

RandomForest

Method Detail

error

importance

size

trim

predict

predict

test

test

getTrees