Package

smile

nlp

Permalink

package nlp

Natural language processing.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. nlp
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. object $dummy

    Permalink

    Hacking scaladoc issue-8124.

    Hacking scaladoc issue-8124. The user should ignore this object.

  2. def bigram(p: Double, minFreq: Int, text: String*): Array[nlp.collocation.Bigram]

    Permalink

    Identify bigram collocations whose p-value is less than the given threshold.

    Identify bigram collocations whose p-value is less than the given threshold.

    p

    the p-value threshold

    minFreq

    the minimum frequency of collocation.

    text

    input text.

    returns

    significant bigram collocations in descending order of likelihood ratio.

  3. def bigram(k: Int, minFreq: Int, text: String*): Array[nlp.collocation.Bigram]

    Permalink

    Identify bigram collocations (words that often appear consecutively) within corpora.

    Identify bigram collocations (words that often appear consecutively) within corpora. They may also be used to find other associations between word occurrences.

    Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then requiring filtering to only retain useful content terms. Each n-gram of words may then be scored according to some association measure, in order to determine the relative likelihood of each n-gram being a collocation.

    k

    finds top k bigram.

    minFreq

    the minimum frequency of collocation.

    text

    input text.

    returns

    significant bigram collocations in descending order of likelihood ratio.

  4. def corpus(text: Seq[String]): SimpleCorpus

    Permalink

    Creates an in-memory text corpus.

    Creates an in-memory text corpus.

    text

    a set of text.

  5. def df(terms: Array[String], corpus: Array[Map[String, Int]]): Array[Int]

    Permalink

    Returns the document frequencies, i.e.

    Returns the document frequencies, i.e. the number of documents that contain term.

    terms

    the token list used as features.

    corpus

    the training corpus.

    returns

    the array of document frequencies.

  6. val lancaster: LancasterStemmer { def apply(word: String): String }

    Permalink

    The Paice/Husk Lancaster stemming algorithm.

    The Paice/Husk Lancaster stemming algorithm. The stemmer is a conflation based iterative stemmer. The stemmer, although remaining efficient and easily implemented, is known to be very strong and aggressive. The stemmer utilizes a single table of rules, each of which may specify the removal or replacement of an ending.

  7. def ngram(maxNGramSize: Int, minFreq: Int, text: String*): Array[Array[nlp.collocation.NGram]]

    Permalink

    An Apiori-like algorithm to extract n-gram phrases.

    An Apiori-like algorithm to extract n-gram phrases.

    maxNGramSize

    The maximum length of n-gram

    minFreq

    The minimum frequency of n-gram in the sentences.

    text

    input text.

    returns

    An array of sets of n-grams. The i-th entry is the set of i-grams.

  8. implicit def pimpString(string: String): PimpedString

    Permalink
  9. val porter: PorterStemmer { def apply(word: String): String }

    Permalink

    Porter's stemming algorithm.

    Porter's stemming algorithm. The stemmer is based on the idea that the suffixes in the English language are mostly made up of a combination of smaller and simpler suffixes. This is a linear step stemmer. Specifically it has five steps applying rules within each step. Within each step, if a suffix rule matched to a word, then the conditions attached to that rule are tested on what would be the resulting stem, if that suffix was removed, in the way defined by the rule. Once a Rule passes its conditions and is accepted the rule fires and the suffix is removed and control moves to the next step. If the rule is not accepted then the next rule in the step is tested, until either a rule from that step fires and control passes to the next step or there are no more rules in that step whence control moves to the next step.

  10. def postag(sentence: Array[String]): Array[PennTreebankPOS]

    Permalink

    Part-of-speech taggers.

    Part-of-speech taggers.

    sentence

    a sentence that is already segmented to words.

    returns

    the pos tags.

  11. def tfidf(bag: Array[Double], n: Int, df: Array[Int]): Array[Double]

    Permalink

    Converts a bag of words to a feature vector by TF-IDF, which is normalized to L2 norm 1.

    Converts a bag of words to a feature vector by TF-IDF, which is normalized to L2 norm 1.

    bag

    the bag-of-words feature vector of a document.

    n

    the number of documents in training corpus.

    df

    the number of documents containing the given term in the corpus.

    returns

    TF-IDF feature vector

  12. def tfidf(corpus: Array[Array[Double]]): Array[Array[Double]]

    Permalink

    Converts a corpus to TF-IDF feature vectors, which are normalized to L2 norm 1.

    Converts a corpus to TF-IDF feature vectors, which are normalized to L2 norm 1.

    corpus

    the corpus of documents in bag-of-words representation.

    returns

    a matrix of which each row is the TF-IDF feature vector.

  13. def vectorize(terms: Array[String], bag: Set[String]): Array[Int]

    Permalink

    Converts a binary bag of words to a sparse feature vector.

    Converts a binary bag of words to a sparse feature vector.

    terms

    the token list used as features.

    bag

    the bag of words.

    returns

    an integer vector, which elements are the indices of presented feature tokens in ascending order.

  14. def vectorize(terms: Array[String], bag: Map[String, Int]): Array[Double]

    Permalink

    Converts a bag of words to a feature vector.

    Converts a bag of words to a feature vector.

    terms

    the token list used as features.

    bag

    the bag of words.

    returns

    a vector of frequency of feature tokens in the bag.

Inherited from AnyRef

Inherited from Any

Ungrouped