public class DecisionTree extends java.lang.Object implements SoftClassifier<double[]>, java.io.Serializable
The algorithms that are used for constructing decision trees usually work top-down by choosing a variable at each step that is the next best variable to use in splitting the set of items. "Best" is defined by how well the variable splits the set into homogeneous subsets that have the same value of the target variable. Different algorithms use different formulae for measuring "best". Used by the CART algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category. Information gain is another popular measure, used by the ID3, C4.5 and C5.0 algorithms. Information gain is based on the concept of entropy used in information theory. For categorical variables with different number of levels, however, information gain are biased in favor of those attributes with more levels. Instead, one may employ the information gain ratio, which solves the drawback of information gain.
Classification and Regression Tree techniques have a number of advantages over many of those alternative techniques.
Some techniques such as bagging, boosting, and random forest use more than one decision tree for their analysis.
AdaBoost
,
GradientTreeBoost
,
RandomForest
,
Serialized FormModifier and Type | Class and Description |
---|---|
static class |
DecisionTree.SplitRule
The criterion to choose variable to split instances.
|
static class |
DecisionTree.Trainer
Trainer for decision tree classifiers.
|
Constructor and Description |
---|
DecisionTree(smile.data.Attribute[] attributes,
double[][] x,
int[] y,
int maxNodes)
Constructor.
|
DecisionTree(smile.data.Attribute[] attributes,
double[][] x,
int[] y,
int maxNodes,
DecisionTree.SplitRule rule)
Constructor.
|
DecisionTree(smile.data.Attribute[] attributes,
double[][] x,
int[] y,
int maxNodes,
int nodeSize,
DecisionTree.SplitRule rule)
Constructor.
|
DecisionTree(smile.data.Attribute[] attributes,
double[][] x,
int[] y,
int maxNodes,
int nodeSize,
int mtry,
DecisionTree.SplitRule rule,
int[] samples,
int[][] order)
Constructor.
|
DecisionTree(double[][] x,
int[] y,
int maxNodes)
Constructor.
|
DecisionTree(double[][] x,
int[] y,
int maxNodes,
DecisionTree.SplitRule rule)
Constructor.
|
DecisionTree(double[][] x,
int[] y,
int maxNodes,
int nodeSize,
DecisionTree.SplitRule rule)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
double[] |
importance()
Returns the variable importance.
|
int |
maxDepth()
Returns the maximum depth" of the tree -- the number of
nodes along the longest path from the root node
down to the farthest leaf node.
|
int |
predict(double[] x)
Predicts the class label of an instance.
|
int |
predict(double[] x,
double[] posteriori)
Predicts the class label of an instance and also calculate a posteriori
probabilities.
|
public DecisionTree(double[][] x, int[] y, int maxNodes)
x
- the training instances.y
- the response variable.maxNodes
- the maximum number of leaf nodes in the tree.public DecisionTree(double[][] x, int[] y, int maxNodes, DecisionTree.SplitRule rule)
x
- the training instances.y
- the response variable.maxNodes
- the maximum number of leaf nodes in the tree.rule
- the splitting rule.public DecisionTree(double[][] x, int[] y, int maxNodes, int nodeSize, DecisionTree.SplitRule rule)
x
- the training instances.y
- the response variable.maxNodes
- the maximum number of leaf nodes in the tree.nodeSize
- the minimum size of leaf nodes.rule
- the splitting rule.public DecisionTree(smile.data.Attribute[] attributes, double[][] x, int[] y, int maxNodes)
attributes
- the attribute properties.x
- the training instances.y
- the response variable.maxNodes
- the maximum number of leaf nodes in the tree.public DecisionTree(smile.data.Attribute[] attributes, double[][] x, int[] y, int maxNodes, DecisionTree.SplitRule rule)
attributes
- the attribute properties.x
- the training instances.y
- the response variable.maxNodes
- the maximum number of leaf nodes in the tree.rule
- the splitting rule.public DecisionTree(smile.data.Attribute[] attributes, double[][] x, int[] y, int maxNodes, int nodeSize, DecisionTree.SplitRule rule)
attributes
- the attribute properties.x
- the training instances.y
- the response variable.nodeSize
- the minimum size of leaf nodes.maxNodes
- the maximum number of leaf nodes in the tree.rule
- the splitting rule.public DecisionTree(smile.data.Attribute[] attributes, double[][] x, int[] y, int maxNodes, int nodeSize, int mtry, DecisionTree.SplitRule rule, int[] samples, int[][] order)
attributes
- the attribute properties.x
- the training instances.y
- the response variable.nodeSize
- the minimum size of leaf nodes.maxNodes
- the maximum number of leaf nodes in the tree.mtry
- the number of input variables to pick to split on at each
node. It seems that sqrt(p) give generally good performance, where p
is the number of variables.rule
- the splitting rule.order
- the index of training values in ascending order. Note
that only numeric attributes need be sorted.samples
- the sample set of instances for stochastic learning.
samples[i] is the number of sampling for instance i.public double[] importance()
public int predict(double[] x)
Classifier
predict
in interface Classifier<double[]>
x
- the instance to be classified.public int predict(double[] x, double[] posteriori)
predict
in interface SoftClassifier<double[]>
x
- the instance to be classified.posteriori
- the array to store a posteriori probabilities on output.public int maxDepth()