Created by eugeny.malyutin on 06.05.16.
Created by eugeny.malyutin on 06.05.16.
Created by eugeny.malyutin on 06.05.16.
Transformer to count Term - Freq for text corpus distinct per document ([T1 T1 T2],[T2] -> {T1 -> 1/3 ,T2 -> 2/3})
Created by eugeny.malyutin on 05.05.16.
Created by eugeny.malyutin on 05.05.16.
Deduplicator based on RandomProjectionsHasher and cosine similarity.
Created by eugeny.malyutin on 06.05.16.
Created by eugeny.malyutin on 06.05.16.
dataframe's join as transformer with right dataframe and col. expression as parameters used to join two dataframes through pipeline Don't save such pipelines.
Created by eugeny.malyutin on 05.05.16.
Created by eugeny.malyutin on 05.05.16.
Created by eugeny.malyutin on 05.05.16.
LanguageDetector transformer from String Column with Text creates String column with language-code or (Unknown) Bulded around optimaize langdetect
Created by eugeny.malyutin on 17.05.16.
Created by eugeny.malyutin on 17.05.16. Simple NGramExtractor Transformer with option to extract from lowerNGrams to UPper together
ml.feature.CountVectorizer and CountVectorizerModel extension with vocabulary or vocab size saved in outputColumn metadata as AttributeGroup
Created by eugeny.malyutin on 06.05.16.
Created by eugeny.malyutin on 06.05.16.
Original CountVectorizer is badly implemented - does not conform to ML pipeline interface, uses caching in a leaking manner and etc. Here we fix part of the problems, but more is to be done.
TODO: Get rid of Vocabulary in constructor, adapt o ModelWithSummary interface.
Created by eugeny.malyutin on 05.05.16.
Created by eugeny.malyutin on 05.05.16.
Implementation of Locality-sensitive hashing(similar vectors - similar hashes) via Random Binary Projection as ml.Transformer requires DataFrame with inputCol as linalg.Vector as data representation and output's Long column with HashValue
If dimensions is not set - will search AttributeGroup in metadata
Created by eugeny.malyutin on 12.05.16.
Created by eugeny.malyutin on 12.05.16.
regexp_replace in Transformer
Created by eugeny.malyutin on 05.05.16.
Created by eugeny.malyutin on 05.05.16.
Transformer to remove URL's from text based on lucene UAX29URLEmailTokenizer With given column inputColumn of StringType returns outputColumn of StringType with text filtered non-url
Created by eugeny.malyutin on 28.07.16.
Created by eugeny.malyutin on 28.07.16. This object is created to wrap language detector functionality for re-using without depending on spark.ml.* (Transformer and etc)
Created by eugeny.malyutin on 28.07.16.
Created by eugeny.malyutin on 28.07.16.
Created by eugeny.malyutin on 28.07.16.
Created by eugeny.malyutin on 06.05.16. implementation of continuously EWMA/EWMVar(Exponential-Weighted)/Sig(~=z-score) updater as mllib Transformer Term, newFreq, oldEWMA,oldEWMVar -> EWStruct(newEWMA, newEWMVar, Sig)
Formulas comes from here: T. Finch. Incremental calculation of weighted mean and variance. Technical report, University of Cambridge, 2009.