Class

com.getjenny.manaus

KeywordsExtraction

Related Doc: package manaus

Permalink

class KeywordsExtraction extends LazyLogging

Created by Mario Alemi on 07/04/2017 in El Estrecho, Putumayo, Peru

conversations: A List of Strings, where each element is a conversation. tokenizer priorOccurrences: A Map with occurrences of words as given by external corpora (wiki etc)

Example of usage:

import scala.io.Source // Load the prior occurrences val wordColumn = 1 val occurrenceColumn = 2 val filePath = "/Users/mal/pCloud/Data/word_frequency.tsv" val priorOccurrences: Map[String, Int] = (for (line <- Source.fromFile(filePath).getLines) yield (line.split("\t")(wordColumn).toLowerCase -> line.split("\t")(occurrenceColumn).toInt)) .toMap.withDefaultValue(0) // instantiate the Conversations val rawConversations = Source.fromFile("/Users/mal/pCloud/Scala/manaus/convs.head.csv").getLines.toList val conversations = new Conversations(rawConversations=rawConversations, tokenizer=tokenizer, priorOccurrences=priorOccurrences)

Linear Supertypes
LazyLogging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. KeywordsExtraction
  2. LazyLogging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new KeywordsExtraction(priorOccurrences: TokenOccurrence, observedOccurrences: TokenOccurrence)

    Permalink

    priorOccurrences

    Map with occurrence for each word from a corpus different from the conversation log.

    observedOccurrences

    occurrence of terms into the observed vocabulary

Type Members

  1. class Sentence extends AnyRef

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  7. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  8. def extractBagsActive(activePotentialKeywordsMap: Map[String, Double], informativeKeywords: Stream[(List[String], List[(String, Double)])], misspellMaxOccurrence: Int = 5): Stream[(List[String], Map[String, Double])]

    Permalink

    extract the final keywords with active potential weighting

    extract the final keywords with active potential weighting

    activePotentialKeywordsMap

    map of keywords weighted by active potential (see getWordsActivePotentialMap)

    informativeKeywords

    the list of informative keywords for each sentence

    misspellMaxOccurrence

    given a big enough sample, min freq beyond what we consider the token a misspell

    returns

    the final list of keywords for each sentence

  9. def extractBagsActiveForSentence(activePotentialKeywordsMap: Map[String, Double], informativeKeywords: (List[String], List[(String, Double)]), misspellMaxOccurrence: Int = 5): (List[String], Map[String, Double])

    Permalink

    extract the final keywords with active potential weighting

    extract the final keywords with active potential weighting

    activePotentialKeywordsMap

    map of keywords weighted by active potential (see getWordsActivePotentialMap)

    informativeKeywords

    the list of informative keywords for a sentence

    misspellMaxOccurrence

    given a big enough sample, min freq beyond what we consider the token a misspell

    returns

    the final list of keywords for a sentence

  10. def extractBagsNoActive(informativeKeywords: Stream[(List[String], List[(String, Double)])], misspellMaxOccurrence: Int = 5): Stream[(List[String], Map[String, Double])]

    Permalink

    extract the final keywords without active potential weighting

    extract the final keywords without active potential weighting

    informativeKeywords

    the list of informative keywords for each sentence

    misspellMaxOccurrence

    given a big enough sample, min freq beyond what we consider the token a misspell

    returns

    the final list of keywords for each sentence

  11. def extractBagsNoActiveForSentence(informativeKeywords: (List[String], List[(String, Double)]), misspellMaxOccurrence: Int = 5): (List[String], Map[String, Double])

    Permalink

    extract the final keywords without active potential weighting for a single sentence

    extract the final keywords without active potential weighting for a single sentence

    informativeKeywords

    the list of informative keywords for a sentence

    misspellMaxOccurrence

    given a big enough sample, min freq beyond what we consider the token a misspell

    returns

    the final list of keywords for a sentence

  12. def extractInformativeWords(sentence: List[String], pruneSentence: Int = 100000, minWordsPerSentence: Int = 10, totalInformationNorm: Boolean): List[(String, Double)]

    Permalink

    Informative words Because we want to check that keywords are correctly extracted, will have tuple like (original words, keywords, bigrams...)

    Informative words Because we want to check that keywords are correctly extracted, will have tuple like (original words, keywords, bigrams...)

    sentence

    a sentence as a list of words

    pruneSentence

    a threshold on the number of terms for trigger pruning

    minWordsPerSentence

    the minimum amount of words on each sentence

    returns

    the list of most informative words for each sentence

  13. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  14. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  15. def getWordsActivePotentialMap(informativeKeywords: Stream[List[(String, Double)]], decay: Int = 10): Map[String, Double]

    Permalink

    Refined keywords list for a stream of sentences, Now we want to filter the important keywords.

    Refined keywords list for a stream of sentences, Now we want to filter the important keywords. These are the ones who appear often enough not to surprise us anymore.

    informativeKeywords

    the list of informative words for each sentence

    returns

    the map of keywords weighted with active potential

  16. def getWordsActivePotentialMapForSentence(informativeKeywords: List[(String, Double)], decay: Int = 10): Map[String, Double]

    Permalink

    Refined keywords list for a single sentence, Now we want to filter the important keywords.

    Refined keywords list for a single sentence, Now we want to filter the important keywords. These are the ones who appear often enough not to surprise us anymore.

    informativeKeywords

    the list of informative words for the sentence

    returns

    the map of keywords weighted with active potential

  17. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  18. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  19. lazy val logger: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    LazyLogging
  20. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  21. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  22. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  23. def pruneSentence(sentence: List[String], minObservedNForPruning: Int = 100000, min_chars: Int = 2): List[String]

    Permalink

    Clean a list of tokens e.g.

    Clean a list of tokens e.g. No words with two letters, words which appear only once in the corpus (if this is big enough)

    sentence

    the list of the token of the sentence

    minObservedNForPruning

    the min number of occurrences of the word in the corpus vocabulary

    min_chars

    the min number of character for a token

    returns

    a cleaned list of tokens

  24. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  25. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  26. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  27. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  28. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from LazyLogging

Inherited from AnyRef

Inherited from Any

Ungrouped