Map with occurrence for each word from a corpus different from the conversation log.
occurrence of terms into the observed vocabulary
extract the final keywords with active potential weighting
extract the final keywords with active potential weighting
map of keywords weighted by active potential (see getWordsActivePotentialMap)
the list of informative keywords for each sentence
given a big enough sample, min freq beyond what we consider the token a misspell
the final list of keywords for each sentence
extract the final keywords with active potential weighting
extract the final keywords with active potential weighting
map of keywords weighted by active potential (see getWordsActivePotentialMap)
the list of informative keywords for a sentence
given a big enough sample, min freq beyond what we consider the token a misspell
the final list of keywords for a sentence
extract the final keywords without active potential weighting
extract the final keywords without active potential weighting
the list of informative keywords for each sentence
given a big enough sample, min freq beyond what we consider the token a misspell
the final list of keywords for each sentence
extract the final keywords without active potential weighting for a single sentence
extract the final keywords without active potential weighting for a single sentence
the list of informative keywords for a sentence
given a big enough sample, min freq beyond what we consider the token a misspell
the final list of keywords for a sentence
Informative words Because we want to check that keywords are correctly extracted, will have tuple like (original words, keywords, bigrams...)
Informative words Because we want to check that keywords are correctly extracted, will have tuple like (original words, keywords, bigrams...)
a sentence as a list of words
a threshold on the number of terms for trigger pruning
the minimum amount of words on each sentence
the list of most informative words for each sentence
Refined keywords list for a stream of sentences, Now we want to filter the important keywords.
Refined keywords list for a stream of sentences, Now we want to filter the important keywords. These are the ones who appear often enough not to surprise us anymore.
the list of informative words for each sentence
the map of keywords weighted with active potential
Refined keywords list for a single sentence, Now we want to filter the important keywords.
Refined keywords list for a single sentence, Now we want to filter the important keywords. These are the ones who appear often enough not to surprise us anymore.
the list of informative words for the sentence
the map of keywords weighted with active potential
Clean a list of tokens e.g.
Clean a list of tokens e.g. No words with two letters, words which appear only once in the corpus (if this is big enough)
the list of the token of the sentence
the min number of occurrences of the word in the corpus vocabulary
the min number of character for a token
a cleaned list of tokens
Created by Mario Alemi on 07/04/2017 in El Estrecho, Putumayo, Peru
conversations: A List of
String
s, where each element is a conversation. tokenizer priorOccurrences: A Map with occurrences of words as given by external corpora (wiki etc)Example of usage:
import scala.io.Source // Load the prior occurrences val wordColumn = 1 val occurrenceColumn = 2 val filePath = "/Users/mal/pCloud/Data/word_frequency.tsv" val priorOccurrences: Map[String, Int] = (for (line <- Source.fromFile(filePath).getLines) yield (line.split("\t")(wordColumn).toLowerCase -> line.split("\t")(occurrenceColumn).toInt)) .toMap.withDefaultValue(0) // instantiate the Conversations val rawConversations = Source.fromFile("/Users/mal/pCloud/Scala/manaus/convs.head.csv").getLines.toList val conversations = new Conversations(rawConversations=rawConversations, tokenizer=tokenizer, priorOccurrences=priorOccurrences)