public class NaiveBayes extends Object implements OnlineClassifier<double[]>
In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations and are very popular in Natural Language Processing (NLP).
For a general purpose naive Bayes classifier without any assumptions
about the underlying distribution of each variable, we don't provide
a learning method to infer the variable distributions from the training data.
Instead, the users can fit any appropriate distributions on the data by
themselves with various Distribution
classes. Although the predict(double[])
method takes an array of double values as a general form of independent variables,
the users are free to use any discrete distributions to model categorical or
ordinal random variables.
For document classification in NLP, there are two different ways we can set up an naive Bayes classifier: multinomial model and Bernoulli model. The multinomial model generates one term from the vocabulary in each position of the document. The multivariate Bernoulli model or Bernoulli model generates an indicator for each term of the vocabulary, either indicating presence of the term in the document or indicating absence. Of the two models, the Bernoulli model is particularly sensitive to noise features. A Bernoulli naive Bayes classifier requires some form of feature selection or else its accuracy will be low.
The different generation models imply different estimation strategies and different classification rules. The Bernoulli model estimates as the fraction of documents of class that contain term. In contrast, the multinomial model estimates as the fraction of tokens or fraction of positions in documents of class that contain term. When classifying a test document, the Bernoulli model uses binary occurrence information, ignoring the number of occurrences, whereas the multinomial model keeps track of multiple occurrences. As a result, the Bernoulli model typically makes many mistakes when classifying long documents. However, it was reported that the Bernoulli model works better in sentiment analysis.
The models also differ in how non-occurring terms are used in classification. They do not affect the classification decision in the multinomial model; but in the Bernoulli model the probability of nonoccurrence is factored in when computing. This is because only the Bernoulli model models absence of terms explicitly.
Modifier and Type | Class and Description |
---|---|
static class |
NaiveBayes.Model
The generation models of naive Bayes classifier.
|
static class |
NaiveBayes.Trainer
Trainer for naive Bayes classifier for document classification.
|
Constructor and Description |
---|
NaiveBayes(double[] priori,
Distribution[][] condprob)
Constructor of general naive Bayes classifier.
|
NaiveBayes(NaiveBayes.Model model,
double[] priori,
int p)
Constructor of naive Bayes classifier for document classification.
|
NaiveBayes(NaiveBayes.Model model,
double[] priori,
int p,
double sigma)
Constructor of naive Bayes classifier for document classification.
|
NaiveBayes(NaiveBayes.Model model,
int k,
int p)
Constructor of naive Bayes classifier for document classification.
|
NaiveBayes(NaiveBayes.Model model,
int k,
int p,
double sigma)
Constructor of naive Bayes classifier for document classification.
|
Modifier and Type | Method and Description |
---|---|
double[] |
getPriori()
Returns a priori probabilities.
|
void |
learn(double[][] x,
int[] y)
Online learning of naive Bayes classifier on sequences,
which are modeled as a bag of words.
|
void |
learn(double[] x,
int y)
Online learning of naive Bayes classifier on a sequence,
which is modeled as a bag of words.
|
int |
predict(double[] x)
Predict the class of an instance.
|
int |
predict(double[] x,
double[] posteriori)
Predict the class of an instance.
|
public NaiveBayes(double[] priori, Distribution[][] condprob)
priori
- the priori probability of each class.condprob
- the conditional distribution of each variable in
each class. In particular, condprob[i][j] is the conditional
distribution P(xj | class i).public NaiveBayes(NaiveBayes.Model model, int k, int p)
model
- the generation model of naive Bayes classifier.k
- the number of classes.p
- the dimensionality of input space.public NaiveBayes(NaiveBayes.Model model, int k, int p, double sigma)
model
- the generation model of naive Bayes classifier.k
- the number of classes.p
- the dimensionality of input space.sigma
- the prior count of add-k smoothing of evidence.public NaiveBayes(NaiveBayes.Model model, double[] priori, int p)
model
- the generation model of naive Bayes classifier.priori
- the priori probability of each class.p
- the dimensionality of input space.public NaiveBayes(NaiveBayes.Model model, double[] priori, int p, double sigma)
model
- the generation model of naive Bayes classifier.priori
- the priori probability of each class.p
- the dimensionality of input space.sigma
- the prior count of add-k smoothing of evidence.public double[] getPriori()
public void learn(double[] x, int y)
learn
in interface OnlineClassifier<double[]>
x
- training instance.y
- training label in [0, k), where k is the number of classes.public void learn(double[][] x, int[] y)
x
- training instances.y
- training labels in [0, k), where k is the number of classes.public int predict(double[] x)
predict
in interface Classifier<double[]>
x
- the instance to be classified.public int predict(double[] x, double[] posteriori)
predict
in interface Classifier<double[]>
x
- the instance to be classified.posteriori
- the array to store a posteriori probabilities on output.Copyright © 2015. All rights reserved.