A family of NGramIndexer that can unpack or strip off specific words, query the order of an packed ngram, etc.
Transformer that uses CoreNLP to (in order): - Tokenize document - Lemmatize tokens - Replace entities w/ their type (e.g.
Transformer that uses CoreNLP to (in order): - Tokenize document - Lemmatize tokens - Replace entities w/ their type (e.g. "Jon" => "NAME", "Paris" => "PLACE") - Return n-grams for the above (respecting sentence boundaries) Note: Much slower than just using Tokenizer followed by NGramsFeaturizer
The size of the n-grams to output
Converts a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing
Converts a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing
Terms are hashed using Scala's .##
method. We may want to convert to MurmurHash3 for strings,
as discussed for Spark's ML Pipelines in https://issues.apache.org/jira/browse/SPARK-10574
The desired feature space to convert to using the hashing trick.
Partitions each ngram by hashing on its first two words (first as in farthest away
from the current word), then mod by numPartitions
.
Partitions each ngram by hashing on its first two words (first as in farthest away
from the current word), then mod by numPartitions
.
Useful for grouping ngrams that share the first two words in context. An example usage is the StupidBackoffEstimator.
type of each word (e.g. Int or String)
Transformer that converts a String to lower case
Transformer that converts a String to lower case
The locale to use. Defaults to Locale.getDefault
An NGram representation that is a thin wrapper over Array[String].
An NGram representation that is a thin wrapper over Array[String]. The
underlying tokens can be accessed via words
.
Its hashCode
and equals
implementations are sane so that it can
be used as keys in RDDs or hash tables.
A simple transformer that represents each ngram as an NGram and counts their occurrence.
A simple transformer that represents each ngram as an NGram and counts their occurrence. Returns an RDD[(NGram, Int)] that is sorted by frequency in descending order.
This implementation may not be space-efficient, but should handle commonly-sized workloads well.
a control flag defined in NGramsCountsMode
An ngram featurizer.
An ngram featurizer.
valid ngram orders, must be consecutive positive integers
Converts the n-grams of a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing
Converts the n-grams of a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing
It computes a rolling MurmurHash3 instead of fully constructing the n-grams, making it more efficient than using NGramsFeaturizer followed by HashingTF, although it should return the exact same feature vector. The MurmurHash3 methods are copied from scala.util.hashing.MurmurHash3
Individual terms are hashed using Scala's .##
method. We may want to convert to MurmurHash3 for strings,
as discussed for Spark's ML Pipelines in https://issues.apache.org/jira/browse/SPARK-10574
valid ngram orders, must be consecutive positive integers
The desired feature space to convert to using the hashing trick.
Estimates a Stupid Backoff ngram language model, which was introduced in the following paper:
Estimates a Stupid Backoff ngram language model, which was introduced in the following paper:
Brants, Thorsten, et al. "Large language models in machine translation." 2007.
The results are scores indicating likeliness of each ngram, but they are not normalized probabilities. The score for an n-gram is defined recursively:
S(w_i | w_{i - n + 1}{i - 1}) := if numerator > 0: freq(w_{i - n + 1}i) / freq(w_{i - n + 1}{i - 1}) otherwise: \alpha * S(w_i | w_{i - n + 2}{i - 1})
S(w_i) := freq(w_i) / N, where N is the total number of tokens in training corpus.
the pre-computed unigram counts of the training corpus
hyperparameter that gets multiplied once per backoff
Transformer that tokenizes a String into a Seq[String] by splitting on a regular expression.
Transformer that tokenizes a String into a Seq[String] by splitting on a regular expression.
the delimiting regular expression to split on. Defaults to matching all punctuation and whitespace
Encodes string tokens as non-negative integers, which are indices of the tokens' positions in the sorted-by-frequency order.
Encodes string tokens as non-negative integers, which are indices of the tokens' positions in the sorted-by-frequency order. Out-of-vocabulary words are mapped to the special index -1.
The parameters passed to this class are usually calculated by WordFrequencyEncoder.
Control flags used for NGramsCounts.
Control flags used for NGramsCounts. Use Default
if counts are to be
aggregated across partitions and sorted; use NoAdd
if just count ngrams within
each partition, with no cross-partition summing or sorting.
Packs up to 3 words (trigrams) into a single Long by bit packing.
Packs up to 3 words (trigrams) into a single Long by bit packing.
Assumptions: (1) |Vocab| <= one million (20 bits per word). (2) Words get mapped into [0, |Vocab|). In particular, each word ID < 2**20.
Transformer that trims a String of leading and trailing whitespace
A family of NGramIndexer that can unpack or strip off specific words, query the order of an packed ngram, etc.
Such indexers are useful for LMs that require backoff contexts (e.g. Stupid Backoff, KN).