Given bags of (key)words, computes most important 2-tuples and 3-tuples
Given bags of (key)words, computes most important 2-tuples and 3-tuples
Makes 4 co-location matrices for bigrams: m11(a, b): How often both terms "a" and "b" appear m10(a, b): "a" appears, "b" doesn't m10(a, b) = m01(b, a) m00(a, b) = none appears
NB the keys are sometimes Set (m11 and m00), sometimes tuples (m10).
Similarly for tri-grams there are m111, m110, m100, m000.
The matrices above are used to compute:
* llrSignificativeBigrams: significative bigrams with the log-likelihood Score * binomialSignificativeBigrams: significative bigrams measuring a higher-than expected frequency of one word compared to the other * trinomialSignificativeTrigrams:
Created by Mario Alemi on 12/04/2017 in Jutai, Amazonas, Brazil
Builds a Binomial prior.
Builds a Binomial prior. Successes should be Int, but are put Double for more flexibility.
Created by Mario Alemi on 06/04/2017.
Trinomial distribution
Trinomial distribution
Created by Mario Alemi on 16/04/2017 in Manaus, Amazonas, Brazil.
Given the occurrences of two words k1 and k2 in a sample of n bigrams, makes the expected relative frequencies:
Given the occurrences of two words k1 and k2 in a sample of n bigrams, makes the expected relative frequencies:
None/n=1-(1-P(1))*(1-P(2)), JustOne/n=P(1)+P(2)-2*P(1)*P(2), Bigram/n=P(1)*P(2)
Ad-hoc tokenizer for our (private) test data.
Ad-hoc tokenizer for our (private) test data.
A string with the conversation in this format: """ "CLIENT: I want to renew a subscription...";"AGENT: Sure, tell me your name..."\n """
List(List("CLIENT", "I want to renew a subscription..."), List("AGENT", "Sure, tell me your name..."))
Created by Mario Alemi on 06/04/2017.