Here, we consider "A" more freq than "B".
Here, we consider "A" more freq than "B". We then consider occurrence(A) as the samples, and A && B as the success.
This gives an approximated result of binomialSignificativeBigrams
Here we compute the surprise in a more rigorous way.
Here we compute the surprise in a more rigorous way. Consider the bigram(A, B). We construct three events:
None=1-(1-P(A))*(1-P(B)), JustOne=P(A)+P(B)-2*P(A)*P(B), Bigram=P(A)*P(B).
We have the total number bigrams (nBigrams), we can then build "None" with M00, "JustOne" with M10+M01, and "Bigram" with M11.
Let's compute the surprise for that number of Bigrams.
The LLR Score for Bigrams
How often the first word appear, and the second word doesn't, in all bags? NB Keys of the map here are not Set2, but Tuples2.
How often the first word appear, and the second word doesn't, in all bags? NB Keys of the map here are not Set2, but Tuples2. Of course, m10(a, b) = m01(b, a)
Given bags of (key)words, computes most important 2-tuples and 3-tuples
Makes 4 co-location matrices for bigrams: m11(a, b): How often both terms "a" and "b" appear m10(a, b): "a" appears, "b" doesn't m10(a, b) = m01(b, a) m00(a, b) = none appears
NB the keys are sometimes Set (m11 and m00), sometimes tuples (m10).
Similarly for tri-grams there are m111, m110, m100, m000.
The matrices above are used to compute:
* llrSignificativeBigrams: significative bigrams with the log-likelihood Score * binomialSignificativeBigrams: significative bigrams measuring a higher-than expected frequency of one word compared to the other * trinomialSignificativeTrigrams:
Created by Mario Alemi on 12/04/2017 in Jutai, Amazonas, Brazil