texts

Type Members

class EWStatsTransformer extends Transformer with Params with DefaultParamsWritable

Created by eugeny.malyutin on 06.05.16.
Created by eugeny.malyutin on 06.05.16. implementation of continuously EWMA/EWMVar(Exponential-Weighted)/Sig(~=z-score) updater as mllib Transformer Term, newFreq, oldEWMA,oldEWMVar -> EWStruct(newEWMA, newEWMVar, Sig)
Formulas comes from here: T. Finch. Incremental calculation of weighted mean and variance. Technical report, University of Cambridge, 2009.
class FreqStatsTransformer extends Transformer with Params with HasInputCol

Created by eugeny.malyutin on 06.05.16.
Created by eugeny.malyutin on 06.05.16.
Transformer to count Term - Freq for text corpus distinct per document ([T1 T1 T2],[T2] -> {T1 -> 1/3 ,T2 -> 2/3})
class HashBasedDeduplicator extends Transformer with Params

Created by eugeny.malyutin on 05.05.16.
Created by eugeny.malyutin on 05.05.16.
Deduplicator based on RandomProjectionsHasher and cosine similarity.
class Joiner extends Transformer with Params

Created by eugeny.malyutin on 06.05.16.
Created by eugeny.malyutin on 06.05.16.
dataframe's join as transformer with right dataframe and col. expression as parameters used to join two dataframes through pipeline Don't save such pipelines.
class LanguageAwareAnalyzer extends Transformer with HasOutputCol with Params with DefaultParamsWritable

Created by eugeny.malyutin on 05.05.16.
class LanguageDetectorTransformer extends Transformer with HasInputCol with HasOutputCol

Created by eugeny.malyutin on 05.05.16.
Created by eugeny.malyutin on 05.05.16.
LanguageDetector transformer from String Column with Text creates String column with language-code or (Unknown) Bulded around optimaize langdetect
class NGramExtractor extends Transformer with DefaultParamsWritable with Params with HasInputCol with HasOutputCol

Created by eugeny.malyutin on 17.05.16.
Created by eugeny.malyutin on 17.05.16. Simple NGramExtractor Transformer with option to extract from lowerNGrams to UPper together
class OdklCountVectorizer extends CountVectorizer with OdklCountVectorizerParams
class OdklCountVectorizerModel extends CountVectorizerModel with OdklCountVectorizerParams

ml.feature.CountVectorizer and CountVectorizerModel extension with vocabulary or vocab size saved in outputColumn metadata as AttributeGroup
trait OdklCountVectorizerParams extends Params

Created by eugeny.malyutin on 06.05.16.
Created by eugeny.malyutin on 06.05.16.
Original CountVectorizer is badly implemented - does not conform to ML pipeline interface, uses caching in a leaking manner and etc. Here we fix part of the problems, but more is to be done.
TODO: Get rid of Vocabulary in constructor, adapt o ModelWithSummary interface.
class RandomProjectionsHasher extends Transformer with HasInputCol with HasOutputCol with HasSeed

Created by eugeny.malyutin on 05.05.16.
Created by eugeny.malyutin on 05.05.16.
Implementation of Locality-sensitive hashing(similar vectors - similar hashes) via Random Binary Projection as ml.Transformer requires DataFrame with inputCol as linalg.Vector as data representation and output's Long column with HashValue
If dimensions is not set - will search AttributeGroup in metadata
class RegexpReplaceTransformer extends Transformer with HasInputCol with HasOutputCol with Params with DefaultParamsWritable

Created by eugeny.malyutin on 12.05.16.
Created by eugeny.malyutin on 12.05.16.
regexp_replace in Transformer
class URLElimminator extends Transformer with HasInputCol with HasOutputCol with Params with DefaultParamsWritable

Created by eugeny.malyutin on 05.05.16.
Created by eugeny.malyutin on 05.05.16.
Transformer to remove URL's from text based on lucene UAX29URLEmailTokenizer With given column inputColumn of StringType returns outputColumn of StringType with text filtered non-url

Value Members

object EWStatsTransformer extends DefaultParamsReadable[EWStatsTransformer] with Serializable
object LanguageAwareAnalyzer extends DefaultParamsReadable[LanguageAwareAnalyzer] with Serializable
object LanguageAwareStemmerUtil

Created by eugeny.malyutin on 28.07.16.
Created by eugeny.malyutin on 28.07.16. This object is created to wrap language detector functionality for re-using without depending on spark.ml.* (Transformer and etc)
object LanguageDetectorUtils

Created by eugeny.malyutin on 28.07.16.
object NGramExtractor extends DefaultParamsReadable[NGramExtractor] with Serializable
object NGramUtils

Created by eugeny.malyutin on 28.07.16.
object RegexpReplaceTransformer extends DefaultParamsReadable[RegexpReplaceTransformer] with Serializable
object URLElimminator extends DefaultParamsReadable[URLElimminator] with Serializable
object URLElimminatorUtil

Created by eugeny.malyutin on 28.07.16.

package texts

Type Members

class EWStatsTransformer extends Transformer with Params with DefaultParamsWritable

class FreqStatsTransformer extends Transformer with Params with HasInputCol

class HashBasedDeduplicator extends Transformer with Params

class Joiner extends Transformer with Params

class LanguageAwareAnalyzer extends Transformer with HasOutputCol with Params with DefaultParamsWritable

class LanguageDetectorTransformer extends Transformer with HasInputCol with HasOutputCol

class NGramExtractor extends Transformer with DefaultParamsWritable with Params with HasInputCol with HasOutputCol

class OdklCountVectorizer extends CountVectorizer with OdklCountVectorizerParams

class OdklCountVectorizerModel extends CountVectorizerModel with OdklCountVectorizerParams

trait OdklCountVectorizerParams extends Params

class RandomProjectionsHasher extends Transformer with HasInputCol with HasOutputCol with HasSeed

class RegexpReplaceTransformer extends Transformer with HasInputCol with HasOutputCol with Params with DefaultParamsWritable

class URLElimminator extends Transformer with HasInputCol with HasOutputCol with Params with DefaultParamsWritable

Value Members

object EWStatsTransformer extends DefaultParamsReadable[EWStatsTransformer] with Serializable

object LanguageAwareAnalyzer extends DefaultParamsReadable[LanguageAwareAnalyzer] with Serializable

object LanguageAwareStemmerUtil

object LanguageDetectorUtils

object NGramExtractor extends DefaultParamsReadable[NGramExtractor] with Serializable

object NGramUtils

object RegexpReplaceTransformer extends DefaultParamsReadable[RegexpReplaceTransformer] with Serializable

object URLElimminator extends DefaultParamsReadable[URLElimminator] with Serializable

object URLElimminatorUtil

Ungrouped