nlp

Type Members

trait BackoffIndexer[WordType, NGramType] extends NGramIndexer[WordType, NGramType]

A family of NGramIndexer that can unpack or strip off specific words, query the order of an packed ngram, etc.
A family of NGramIndexer that can unpack or strip off specific words, query the order of an packed ngram, etc.
Such indexers are useful for LMs that require backoff contexts (e.g. Stupid Backoff, KN).
case class CoreNLPFeatureExtractor(orders: Seq[Int]) extends Transformer[String, Seq[String]] with Product with Serializable

Transformer that uses CoreNLP to (in order): - Tokenize document - Lemmatize tokens - Replace entities w/ their type (e.g.
Transformer that uses CoreNLP to (in order): - Tokenize document - Lemmatize tokens - Replace entities w/ their type (e.g. "Jon" => "NAME", "Paris" => "PLACE") - Return n-grams for the above (respecting sentence boundaries) Note: Much slower than just using Tokenizer followed by NGramsFeaturizer
orders
The size of the n-grams to output
case class HashingTF[T <: Seq[Any]](numFeatures: Int) extends Transformer[T, SparseVector[Double]] with Product with Serializable

Converts a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing
Converts a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing
Terms are hashed using Scala's .## method. We may want to convert to MurmurHash3 for strings, as discussed for Spark's ML Pipelines in https://issues.apache.org/jira/browse/SPARK-10574
numFeatures
The desired feature space to convert to using the hashing trick.
class InitialBigramPartitioner[WordType] extends Partitioner

Partitions each ngram by hashing on its first two words (first as in farthest away from the current word), then mod by numPartitions.
Partitions each ngram by hashing on its first two words (first as in farthest away from the current word), then mod by numPartitions.
Useful for grouping ngrams that share the first two words in context. An example usage is the StupidBackoffEstimator.
WordType
type of each word (e.g. Int or String)
case class LowerCase(locale: Locale = Locale.getDefault) extends Transformer[String, String] with Product with Serializable

Transformer that converts a String to lower case
Transformer that converts a String to lower case
locale
The locale to use. Defaults to Locale.getDefault
class NGram[T] extends Serializable

An NGram representation that is a thin wrapper over Array[String].
An NGram representation that is a thin wrapper over Array[String]. The underlying tokens can be accessed via words.
Its hashCode and equals implementations are sane so that it can be used as keys in RDDs or hash tables.
trait NGramIndexer[WordType, NGramType] extends Serializable
class NGramIndexerImpl[T] extends BackoffIndexer[T, NGram[T]]
case class NGramsCounts[T](mode: NGramsCountsMode.Value = NGramsCountsMode.Default)(implicit evidence$3: ClassTag[T]) extends FunctionNode[RDD[Seq[Seq[T]]], RDD[(NGram[T], Int)]] with Product with Serializable

A simple transformer that represents each ngram as an NGram and counts their occurrence.
A simple transformer that represents each ngram as an NGram and counts their occurrence. Returns an RDD[(NGram, Int)] that is sorted by frequency in descending order.
This implementation may not be space-efficient, but should handle commonly-sized workloads well.
mode
a control flag defined in NGramsCountsMode
case class NGramsFeaturizer[T](orders: Seq[Int])(implicit evidence$1: ClassTag[T]) extends Transformer[Seq[T], Seq[Seq[T]]] with Product with Serializable

An ngram featurizer.
An ngram featurizer.
orders
valid ngram orders, must be consecutive positive integers
case class NGramsHashingTF(orders: Seq[Int], numFeatures: Int) extends Transformer[Seq[String], SparseVector[Double]] with Product with Serializable

Converts the n-grams of a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing
Converts the n-grams of a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing
It computes a rolling MurmurHash3 instead of fully constructing the n-grams, making it more efficient than using NGramsFeaturizer followed by HashingTF, although it should return the exact same feature vector. The MurmurHash3 methods are copied from scala.util.hashing.MurmurHash3
Individual terms are hashed using Scala's .## method. We may want to convert to MurmurHash3 for strings, as discussed for Spark's ML Pipelines in https://issues.apache.org/jira/browse/SPARK-10574
orders
valid ngram orders, must be consecutive positive integers
numFeatures
The desired feature space to convert to using the hashing trick.
case class StupidBackoffEstimator[T](unigramCounts: Map[T, Int], alpha: Double = 0.4)(implicit evidence$3: ClassTag[T]) extends Estimator[(NGram[T], Int), (NGram[T], Double)] with Product with Serializable

Estimates a Stupid Backoff ngram language model, which was introduced in the following paper:
Estimates a Stupid Backoff ngram language model, which was introduced in the following paper:
Brants, Thorsten, et al. "Large language models in machine translation." 2007.
The results are scores indicating likeliness of each ngram, but they are not normalized probabilities. The score for an n-gram is defined recursively:
S(w_i | w_{i - n + 1}^{{i - 1}) :=
if numerator > 0: freq(w_{i - n + 1}}i) / freq(w_{i - n + 1}^{{i - 1})
otherwise: \alpha * S(w_i | w_{i - n + 2}}{i - 1})
S(w_i) := freq(w_i) / N, where N is the total number of tokens in training corpus.
unigramCounts
the pre-computed unigram counts of the training corpus
alpha
hyperparameter that gets multiplied once per backoff
class StupidBackoffModel[T] extends Transformer[(NGram[T], Int), (NGram[T], Double)]
case class Tokenizer(sep: String = "[\\p{Punct}\\s]+") extends Transformer[String, Seq[String]] with Product with Serializable

Transformer that tokenizes a String into a Seq[String] by splitting on a regular expression.
Transformer that tokenizes a String into a Seq[String] by splitting on a regular expression.
sep
the delimiting regular expression to split on. Defaults to matching all punctuation and whitespace
class WordFrequencyTransformer extends Transformer[Seq[String], Seq[Int]]

Encodes string tokens as non-negative integers, which are indices of the tokens' positions in the sorted-by-frequency order.
Encodes string tokens as non-negative integers, which are indices of the tokens' positions in the sorted-by-frequency order. Out-of-vocabulary words are mapped to the special index -1.
The parameters passed to this class are usually calculated by WordFrequencyEncoder.

Value Members

object NGramsCountsMode extends Enumeration

Control flags used for NGramsCounts.
Control flags used for NGramsCounts. Use Default if counts are to be aggregated across partitions and sorted; use NoAdd if just count ngrams within each partition, with no cross-partition summing or sorting.
object NaiveBitPackIndexer extends BackoffIndexer[Int, Long]

Packs up to 3 words (trigrams) into a single Long by bit packing.
Packs up to 3 words (trigrams) into a single Long by bit packing.
Assumptions: (1) |Vocab| <= one million (20 bits per word). (2) Words get mapped into [0, |Vocab|). In particular, each word ID < 2**20.
object Trim extends Transformer[String, String]

Transformer that trims a String of leading and trailing whitespace
object WordFrequencyEncoder extends Estimator[Seq[String], Seq[Int]]

package nlp

Type Members

trait BackoffIndexer[WordType, NGramType] extends NGramIndexer[WordType, NGramType]

case class CoreNLPFeatureExtractor(orders: Seq[Int]) extends Transformer[String, Seq[String]] with Product with Serializable

case class HashingTF[T <: Seq[Any]](numFeatures: Int) extends Transformer[T, SparseVector[Double]] with Product with Serializable

class InitialBigramPartitioner[WordType] extends Partitioner

case class LowerCase(locale: Locale = Locale.getDefault) extends Transformer[String, String] with Product with Serializable

class NGram[T] extends Serializable

trait NGramIndexer[WordType, NGramType] extends Serializable

class NGramIndexerImpl[T] extends BackoffIndexer[T, NGram[T]]

case class NGramsCounts[T](mode: NGramsCountsMode.Value = NGramsCountsMode.Default)(implicit evidence$3: ClassTag[T]) extends FunctionNode[RDD[Seq[Seq[T]]], RDD[(NGram[T], Int)]] with Product with Serializable

case class NGramsFeaturizer[T](orders: Seq[Int])(implicit evidence$1: ClassTag[T]) extends Transformer[Seq[T], Seq[Seq[T]]] with Product with Serializable

case class NGramsHashingTF(orders: Seq[Int], numFeatures: Int) extends Transformer[Seq[String], SparseVector[Double]] with Product with Serializable

case class StupidBackoffEstimator[T](unigramCounts: Map[T, Int], alpha: Double = 0.4)(implicit evidence$3: ClassTag[T]) extends Estimator[(NGram[T], Int), (NGram[T], Double)] with Product with Serializable

class StupidBackoffModel[T] extends Transformer[(NGram[T], Int), (NGram[T], Double)]

case class Tokenizer(sep: String = "[\\p{Punct}\\s]+") extends Transformer[String, Seq[String]] with Product with Serializable

class WordFrequencyTransformer extends Transformer[Seq[String], Seq[Int]]

Value Members

object NGramsCountsMode extends Enumeration

object NaiveBitPackIndexer extends BackoffIndexer[Int, Long]

object Trim extends Transformer[String, String]

object WordFrequencyEncoder extends Estimator[Seq[String], Seq[Int]]

Ungrouped