RandomForest

java.lang.Object
- smile.regression.RandomForest

All Implemented Interfaces:

java.io.Serializable, java.util.function.ToDoubleFunction<smile.data.Tuple>, DataFrameRegression, Regression<smile.data.Tuple>
```
public class RandomForest
extends java.lang.Object
implements Regression<smile.data.Tuple>, DataFrameRegression
```
Random forest for regression. Random forest is an ensemble method that consists of many regression trees and outputs the average of individual trees. The method combines bagging idea and the random selection of features.
Each tree is constructed using the following algorithm:
1. If the number of cases in the training set is N, randomly sample N cases with replacement from the original data. This sample will be the training set for growing the tree.
2. If there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
The advantages of random forest are:
- For many data sets, it produces a highly accurate model.
- It runs efficiently on large data sets.
- It can handle thousands of input variables without variable deletion.
- It gives estimates of what variables are important in the classification.
- It generates an internal unbiased estimate of the generalization error as the forest building progresses.
- It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
The disadvantages are
- Random forests are prone to over-fitting for some datasets. This is even more pronounced in noisy classification/regression tasks.
- For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.
See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description
`RandomForest(smile.data.formula.Formula formula, RegressionTree[] trees, double error, double[] importance)` Constructor.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`double`	`error()` Returns the out-of-bag estimation of RMSE.
`static RandomForest`	`fit(smile.data.formula.Formula formula, smile.data.DataFrame data)` Learns a random forest for regression.
`static RandomForest`	`fit(smile.data.formula.Formula formula, smile.data.DataFrame data, int ntrees, int mtry, int maxDepth, int maxNodes, int nodeSize, double subsample)` Learns a random forest for regression.
`static RandomForest`	`fit(smile.data.formula.Formula formula, smile.data.DataFrame data, int ntrees, int mtry, int maxDepth, int maxNodes, int nodeSize, double subsample, java.util.function.LongSupplier seedGenerator)` Learns a random forest for regression.
`static RandomForest`	`fit(smile.data.formula.Formula formula, smile.data.DataFrame data, int ntrees, int mtry, int maxDepth, int maxNodes, int nodeSize, double subsample, java.util.Optional<java.util.stream.LongStream> seeds)` Learns a random forest for regression.
`static RandomForest`	`fit(smile.data.formula.Formula formula, smile.data.DataFrame data, java.util.Properties prop)` Learns a random forest for regression.
`smile.data.formula.Formula`	`formula()` Returns the formula associated with the model.
`double[]`	`importance()` Returns the variable importance.
`RandomForest`	`merge(RandomForest other)` Merges together two random forests and returns a new forest consisting of trees from both input forests.
`double`	`predict(smile.data.Tuple x)` Predicts the dependent variable of an instance.
`smile.data.type.StructType`	`schema()` Returns the design matrix schema.
`int`	`size()` Returns the number of trees in the model.
`double[][]`	`test(smile.data.DataFrame data)` Test the model on a validation dataset.
`RegressionTree[]`	`trees()` Returns the regression trees.
`void`	`trim(int ntrees)` Trims the tree model set to a smaller size in case of over-fitting.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface smile.regression.Regression
applyAsDouble, predict

Methods inherited from interface smile.regression.DataFrameRegression
predict

- Constructor Detail
  - RandomForest
```
public RandomForest(smile.data.formula.Formula formula,
                    RegressionTree[] trees,
                    double error,
                    double[] importance)
```
    Constructor.
    
    Parameters:
    
    formula - a symbolic description of the model to be fitted.
    
    trees - forest of regression trees.
    
    error - out-of-bag estimation of RMSE
    
    importance - variable importance
- Method Detail
  - fit
```
public static RandomForest fit(smile.data.formula.Formula formula,
                               smile.data.DataFrame data)
```
    Learns a random forest for regression.
    
    Parameters:
    
    formula - a symbolic description of the model to be fitted.
    
    data - the data frame of the explanatory and response variables.
  - fit
```
public static RandomForest fit(smile.data.formula.Formula formula,
                               smile.data.DataFrame data,
                               java.util.Properties prop)
```
    Learns a random forest for regression.
    
    Parameters:
    
    formula - a symbolic description of the model to be fitted.
    
    data - the data frame of the explanatory and response variables.
  - fit
```
public static RandomForest fit(smile.data.formula.Formula formula,
                               smile.data.DataFrame data,
                               int ntrees,
                               int mtry,
                               int maxDepth,
                               int maxNodes,
                               int nodeSize,
                               double subsample)
```
    Learns a random forest for regression.
    
    Parameters:
    
    formula - a symbolic description of the model to be fitted.
    
    data - the data frame of the explanatory and response variables.
    
    ntrees - the number of trees.
    
    mtry - the number of input variables to be used to determine the decision at a node of the tree. p/3 seems to give generally good performance, where p is the number of variables.
    
    maxDepth - the maximum depth of the tree.
    
    maxNodes - the maximum number of leaf nodes in the tree.
    
    nodeSize - the number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results.
    
    subsample - the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
  - fit
```
public static RandomForest fit(smile.data.formula.Formula formula,
                               smile.data.DataFrame data,
                               int ntrees,
                               int mtry,
                               int maxDepth,
                               int maxNodes,
                               int nodeSize,
                               double subsample,
                               java.util.function.LongSupplier seedGenerator)
```
    Learns a random forest for regression.
    
    Parameters:
    
    formula - a symbolic description of the model to be fitted.
    
    data - the data frame of the explanatory and response variables.
    
    ntrees - the number of trees.
    
    mtry - the number of input variables to be used to determine the decision at a node of the tree. p/3 seems to give generally good performance, where p is the number of variables.
    
    maxDepth - the maximum depth of the tree.
    
    maxNodes - the maximum number of leaf nodes in the tree.
    
    nodeSize - the number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results.
    
    subsample - the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
    
    seedGenerator - RNG seed generator.
  - fit
```
public static RandomForest fit(smile.data.formula.Formula formula,
                               smile.data.DataFrame data,
                               int ntrees,
                               int mtry,
                               int maxDepth,
                               int maxNodes,
                               int nodeSize,
                               double subsample,
                               java.util.Optional<java.util.stream.LongStream> seeds)
```
    Learns a random forest for regression.
    
    Parameters:
    
    formula - a symbolic description of the model to be fitted.
    
    data - the data frame of the explanatory and response variables.
    
    ntrees - the number of trees.
    
    mtry - the number of input variables to be used to determine the decision at a node of the tree. p/3 seems to give generally good performance, where p is the number of variables.
    
    maxDepth - the maximum depth of the tree.
    
    maxNodes - the maximum number of leaf nodes in the tree.
    
    nodeSize - the number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results.
    
    subsample - the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
    
    seeds - optional RNG seeds for each regression tree.
  - merge
```
public RandomForest merge(RandomForest other)
```
    Merges together two random forests and returns a new forest consisting of trees from both input forests.
  - formula
```
public smile.data.formula.Formula formula()
```
    Description copied from interface: DataFrameRegression
    
    Returns the formula associated with the model.
    
    Specified by:
    
    formula in interface DataFrameRegression
  - schema
```
public smile.data.type.StructType schema()
```
    Description copied from interface: DataFrameRegression
    
    Returns the design matrix schema.
    
    Specified by:
    
    schema in interface DataFrameRegression
  - error
```
public double error()
```
    Returns the out-of-bag estimation of RMSE. The OOB estimate is quite accurate given that enough trees have been grown. Otherwise the OOB estimate can bias upward.
    
    Returns:
    
    the out-of-bag estimation of RMSE
  - importance
```
public double[] importance()
```
    Returns the variable importance. Every time a split of a node is made on variable the impurity criterion for the two descendent nodes is less than the parent node. Adding up the decreases for each individual variable over all trees in the forest gives a fast measure of variable importance that is often very consistent with the permutation importance measure.
    
    Returns:
    
    the variable importance
  - size
```
public int size()
```
    Returns the number of trees in the model.
    
    Returns:
    
    the number of trees in the model
  - trees
```
public RegressionTree[] trees()
```
    Returns the regression trees.
  - trim
```
public void trim(int ntrees)
```
    Trims the tree model set to a smaller size in case of over-fitting. Or if extra decision trees in the model don't improve the performance, we may remove them to reduce the model size and also improve the speed of prediction.
    
    Parameters:
    
    ntrees - the new (smaller) size of tree model set.
  - predict
```
public double predict(smile.data.Tuple x)
```
    Description copied from interface: Regression
    
    Predicts the dependent variable of an instance.
    
    Specified by:
    
    predict in interface DataFrameRegression
    
    Specified by:
    
    predict in interface Regression<smile.data.Tuple>
    
    Parameters:
    
    x - an instance.
    
    Returns:
    
    the predicted value of dependent variable.
  - test
```
public double[][] test(smile.data.DataFrame data)
```
    Test the model on a validation dataset.
    
    Parameters:
    
    data - the test data set.
    
    Returns:
    
    the predictions with first 1, 2, ..., regression trees.

Class RandomForest

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface smile.regression.Regression

Methods inherited from interface smile.regression.DataFrameRegression

Constructor Detail

RandomForest

Method Detail

fit

fit

fit

fit

fit

merge

formula

schema

error

importance

size

trees

trim

predict

test