Package

keystoneml.nodes

nlp

Permalink

package nlp

Visibility
  1. Public
  2. All

Type Members

  1. trait BackoffIndexer[WordType, NGramType] extends NGramIndexer[WordType, NGramType]

    Permalink

    A family of NGramIndexer that can unpack or strip off specific words, query the order of an packed ngram, etc.

    A family of NGramIndexer that can unpack or strip off specific words, query the order of an packed ngram, etc.

    Such indexers are useful for LMs that require backoff contexts (e.g. Stupid Backoff, KN).

  2. case class CoreNLPFeatureExtractor(orders: Seq[Int]) extends Transformer[String, Seq[String]] with Product with Serializable

    Permalink

    Transformer that uses CoreNLP to (in order): - Tokenize document - Lemmatize tokens - Replace entities w/ their type (e.g.

    Transformer that uses CoreNLP to (in order): - Tokenize document - Lemmatize tokens - Replace entities w/ their type (e.g. "Jon" => "NAME", "Paris" => "PLACE") - Return n-grams for the above (respecting sentence boundaries) Note: Much slower than just using Tokenizer followed by NGramsFeaturizer

    orders

    The size of the n-grams to output

  3. case class HashingTF[T <: Seq[Any]](numFeatures: Int) extends Transformer[T, SparseVector[Double]] with Product with Serializable

    Permalink

    Converts a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing

    Converts a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing

    Terms are hashed using Scala's .## method. We may want to convert to MurmurHash3 for strings, as discussed for Spark's ML Pipelines in https://issues.apache.org/jira/browse/SPARK-10574

    numFeatures

    The desired feature space to convert to using the hashing trick.

  4. class InitialBigramPartitioner[WordType] extends Partitioner

    Permalink

    Partitions each ngram by hashing on its first two words (first as in farthest away from the current word), then mod by numPartitions.

    Partitions each ngram by hashing on its first two words (first as in farthest away from the current word), then mod by numPartitions.

    Useful for grouping ngrams that share the first two words in context. An example usage is the StupidBackoffEstimator.

    WordType

    type of each word (e.g. Int or String)

  5. case class LowerCase(locale: Locale = Locale.getDefault) extends Transformer[String, String] with Product with Serializable

    Permalink

    Transformer that converts a String to lower case

    Transformer that converts a String to lower case

    locale

    The locale to use. Defaults to Locale.getDefault

  6. class NGram[T] extends Serializable

    Permalink

    An NGram representation that is a thin wrapper over Array[String].

    An NGram representation that is a thin wrapper over Array[String]. The underlying tokens can be accessed via words.

    Its hashCode and equals implementations are sane so that it can be used as keys in RDDs or hash tables.

  7. trait NGramIndexer[WordType, NGramType] extends Serializable

    Permalink
  8. class NGramIndexerImpl[T] extends BackoffIndexer[T, NGram[T]]

    Permalink
  9. case class NGramsCounts[T](mode: NGramsCountsMode.Value = NGramsCountsMode.Default)(implicit evidence$3: ClassTag[T]) extends FunctionNode[RDD[Seq[Seq[T]]], RDD[(NGram[T], Int)]] with Product with Serializable

    Permalink

    A simple transformer that represents each ngram as an NGram and counts their occurrence.

    A simple transformer that represents each ngram as an NGram and counts their occurrence. Returns an RDD[(NGram, Int)] that is sorted by frequency in descending order.

    This implementation may not be space-efficient, but should handle commonly-sized workloads well.

    mode

    a control flag defined in NGramsCountsMode

  10. case class NGramsFeaturizer[T](orders: Seq[Int])(implicit evidence$1: ClassTag[T]) extends Transformer[Seq[T], Seq[Seq[T]]] with Product with Serializable

    Permalink

    An ngram featurizer.

    An ngram featurizer.

    orders

    valid ngram orders, must be consecutive positive integers

  11. case class NGramsHashingTF(orders: Seq[Int], numFeatures: Int) extends Transformer[Seq[String], SparseVector[Double]] with Product with Serializable

    Permalink

    Converts the n-grams of a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing

    Converts the n-grams of a sequence of terms to a sparse vector representing their frequencies, using the hashing trick: https://en.wikipedia.org/wiki/Feature_hashing

    It computes a rolling MurmurHash3 instead of fully constructing the n-grams, making it more efficient than using NGramsFeaturizer followed by HashingTF, although it should return the exact same feature vector. The MurmurHash3 methods are copied from scala.util.hashing.MurmurHash3

    Individual terms are hashed using Scala's .## method. We may want to convert to MurmurHash3 for strings, as discussed for Spark's ML Pipelines in https://issues.apache.org/jira/browse/SPARK-10574

    orders

    valid ngram orders, must be consecutive positive integers

    numFeatures

    The desired feature space to convert to using the hashing trick.

  12. case class StupidBackoffEstimator[T](unigramCounts: Map[T, Int], alpha: Double = 0.4)(implicit evidence$3: ClassTag[T]) extends Estimator[(NGram[T], Int), (NGram[T], Double)] with Product with Serializable

    Permalink

    Estimates a Stupid Backoff ngram language model, which was introduced in the following paper:

    Estimates a Stupid Backoff ngram language model, which was introduced in the following paper:

    Brants, Thorsten, et al. "Large language models in machine translation." 2007.

    The results are scores indicating likeliness of each ngram, but they are not normalized probabilities. The score for an n-gram is defined recursively:

    S(w_i | w_{i - n + 1}{i - 1}) := if numerator > 0: freq(w_{i - n + 1}i) / freq(w_{i - n + 1}{i - 1}) otherwise: \alpha * S(w_i | w_{i - n + 2}{i - 1})

    S(w_i) := freq(w_i) / N, where N is the total number of tokens in training corpus.

    unigramCounts

    the pre-computed unigram counts of the training corpus

    alpha

    hyperparameter that gets multiplied once per backoff

  13. class StupidBackoffModel[T] extends Transformer[(NGram[T], Int), (NGram[T], Double)]

    Permalink
  14. case class Tokenizer(sep: String = "[\\p{Punct}\\s]+") extends Transformer[String, Seq[String]] with Product with Serializable

    Permalink

    Transformer that tokenizes a String into a Seq[String] by splitting on a regular expression.

    Transformer that tokenizes a String into a Seq[String] by splitting on a regular expression.

    sep

    the delimiting regular expression to split on. Defaults to matching all punctuation and whitespace

  15. class WordFrequencyTransformer extends Transformer[Seq[String], Seq[Int]]

    Permalink

    Encodes string tokens as non-negative integers, which are indices of the tokens' positions in the sorted-by-frequency order.

    Encodes string tokens as non-negative integers, which are indices of the tokens' positions in the sorted-by-frequency order. Out-of-vocabulary words are mapped to the special index -1.

    The parameters passed to this class are usually calculated by WordFrequencyEncoder.

Value Members

  1. object NGramsCountsMode extends Enumeration

    Permalink

    Control flags used for NGramsCounts.

    Control flags used for NGramsCounts. Use Default if counts are to be aggregated across partitions and sorted; use NoAdd if just count ngrams within each partition, with no cross-partition summing or sorting.

  2. object NaiveBitPackIndexer extends BackoffIndexer[Int, Long]

    Permalink

    Packs up to 3 words (trigrams) into a single Long by bit packing.

    Packs up to 3 words (trigrams) into a single Long by bit packing.

    Assumptions: (1) |Vocab| <= one million (20 bits per word). (2) Words get mapped into [0, |Vocab|). In particular, each word ID < 2**20.

  3. object Trim extends Transformer[String, String]

    Permalink

    Transformer that trims a String of leading and trailing whitespace

  4. object WordFrequencyEncoder extends Estimator[Seq[String], Seq[Int]]

    Permalink

Ungrouped