KeywordsExtraction

Created by Mario Alemi on 07/04/2017 in El Estrecho, Putumayo, Peru

conversations: A List of Strings, where each element is a conversation. tokenizer priorOccurrences: A Map with occurrences of words as given by external corpora (wiki etc)

Example of usage:

import scala.io.Source // Load the prior occurrences val wordColumn = 1 val occurrenceColumn = 2 val filePath = "/Users/mal/pCloud/Data/word_frequency.tsv" val priorOccurrences: Map[String, Int] = (for (line <- Source.fromFile(filePath).getLines) yield (line.split("\t")(wordColumn).toLowerCase -> line.split("\t")(occurrenceColumn).toInt)) .toMap.withDefaultValue(0) // instantiate the Conversations val rawConversations = Source.fromFile("/Users/mal/pCloud/Scala/manaus/convs.head.csv").getLines.toList val conversations = new Conversations(rawConversations=rawConversations, tokenizer=tokenizer, priorOccurrences=priorOccurrences)

Linear Supertypes

LazyLogging, AnyRef, Any

Instance Constructors

new KeywordsExtraction(priorOccurrences: TokenOccurrence, observedOccurrences: TokenOccurrence)

priorOccurrences
Map with occurrence for each word from a corpus different from the conversation log.
observedOccurrences
occurrence of terms into the observed vocabulary

Type Members

class Sentence extends AnyRef

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def extractBagsActive(activePotentialKeywordsMap: Map[String, Double], informativeKeywords: Stream[(List[String], List[(String, Double)])], misspellMaxOccurrence: Int = 5): Stream[(List[String], Map[String, Double])]

extract the final keywords with active potential weighting
extract the final keywords with active potential weighting
activePotentialKeywordsMap
map of keywords weighted by active potential (see getWordsActivePotentialMap)
informativeKeywords
the list of informative keywords for each sentence
misspellMaxOccurrence
given a big enough sample, min freq beyond what we consider the token a misspell
returns
the final list of keywords for each sentence
def extractBagsActiveForSentence(activePotentialKeywordsMap: Map[String, Double], informativeKeywords: (List[String], List[(String, Double)]), misspellMaxOccurrence: Int = 5): (List[String], Map[String, Double])

extract the final keywords with active potential weighting
extract the final keywords with active potential weighting
activePotentialKeywordsMap
map of keywords weighted by active potential (see getWordsActivePotentialMap)
informativeKeywords
the list of informative keywords for a sentence
misspellMaxOccurrence
given a big enough sample, min freq beyond what we consider the token a misspell
returns
the final list of keywords for a sentence
def extractBagsNoActive(informativeKeywords: Stream[(List[String], List[(String, Double)])], misspellMaxOccurrence: Int = 5): Stream[(List[String], Map[String, Double])]

extract the final keywords without active potential weighting
extract the final keywords without active potential weighting
informativeKeywords
the list of informative keywords for each sentence
misspellMaxOccurrence
given a big enough sample, min freq beyond what we consider the token a misspell
returns
the final list of keywords for each sentence
def extractBagsNoActiveForSentence(informativeKeywords: (List[String], List[(String, Double)]), misspellMaxOccurrence: Int = 5): (List[String], Map[String, Double])

extract the final keywords without active potential weighting for a single sentence
extract the final keywords without active potential weighting for a single sentence
informativeKeywords
the list of informative keywords for a sentence
misspellMaxOccurrence
given a big enough sample, min freq beyond what we consider the token a misspell
returns
the final list of keywords for a sentence
def extractInformativeWords(sentence: List[String], pruneSentence: Int = 100000, minWordsPerSentence: Int = 10, totalInformationNorm: Boolean): List[(String, Double)]

Informative words Because we want to check that keywords are correctly extracted, will have tuple like (original words, keywords, bigrams...)
Informative words Because we want to check that keywords are correctly extracted, will have tuple like (original words, keywords, bigrams...)
sentence
a sentence as a list of words
pruneSentence
a threshold on the number of terms for trigger pruning
minWordsPerSentence
the minimum amount of words on each sentence
returns
the list of most informative words for each sentence
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def getWordsActivePotentialMap(informativeKeywords: Stream[List[(String, Double)]], decay: Int = 10): Map[String, Double]

Refined keywords list for a stream of sentences, Now we want to filter the important keywords.
Refined keywords list for a stream of sentences, Now we want to filter the important keywords. These are the ones who appear often enough not to surprise us anymore.
informativeKeywords
the list of informative words for each sentence
returns
the map of keywords weighted with active potential
def getWordsActivePotentialMapForSentence(informativeKeywords: List[(String, Double)], decay: Int = 10): Map[String, Double]

Refined keywords list for a single sentence, Now we want to filter the important keywords.
Refined keywords list for a single sentence, Now we want to filter the important keywords. These are the ones who appear often enough not to surprise us anymore.
informativeKeywords
the list of informative words for the sentence
returns
the map of keywords weighted with active potential
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
lazy val logger: Logger

Attributes
protected
Definition Classes
LazyLogging
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def pruneSentence(sentence: List[String], minObservedNForPruning: Int = 100000, min_chars: Int = 2): List[String]

Clean a list of tokens e.g.
Clean a list of tokens e.g. No words with two letters, words which appear only once in the corpus (if this is big enough)
sentence
the list of the token of the sentence
minObservedNForPruning
the min number of occurrences of the word in the corpus vocabulary
min_chars
the min number of character for a token
returns
a cleaned list of tokens
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package manaus

class KeywordsExtraction extends LazyLogging

Instance Constructors

new KeywordsExtraction(priorOccurrences: TokenOccurrence, observedOccurrences: TokenOccurrence)

Type Members

class Sentence extends AnyRef

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def extractBagsActive(activePotentialKeywordsMap: Map[String, Double], informativeKeywords: Stream[(List[String], List[(String, Double)])], misspellMaxOccurrence: Int = 5): Stream[(List[String], Map[String, Double])]

def extractBagsActiveForSentence(activePotentialKeywordsMap: Map[String, Double], informativeKeywords: (List[String], List[(String, Double)]), misspellMaxOccurrence: Int = 5): (List[String], Map[String, Double])

def extractBagsNoActive(informativeKeywords: Stream[(List[String], List[(String, Double)])], misspellMaxOccurrence: Int = 5): Stream[(List[String], Map[String, Double])]

def extractBagsNoActiveForSentence(informativeKeywords: (List[String], List[(String, Double)]), misspellMaxOccurrence: Int = 5): (List[String], Map[String, Double])

def extractInformativeWords(sentence: List[String], pruneSentence: Int = 100000, minWordsPerSentence: Int = 10, totalInformationNorm: Boolean): List[(String, Double)]

def finalize(): Unit

final def getClass(): Class[_]

def getWordsActivePotentialMap(informativeKeywords: Stream[List[(String, Double)]], decay: Int = 10): Map[String, Double]

def getWordsActivePotentialMapForSentence(informativeKeywords: List[(String, Double)], decay: Int = 10): Map[String, Double]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

lazy val logger: Logger

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def pruneSentence(sentence: List[String], minObservedNForPruning: Int = 100000, min_chars: Int = 2): List[String]

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from LazyLogging

Inherited from AnyRef

Inherited from Any

Ungrouped