Package

org.apache.spark.ml.odkl

texts

Permalink

package texts

Visibility
  1. Public
  2. All

Type Members

  1. class EWStatsTransformer extends Transformer with Params with DefaultParamsWritable

    Permalink

    Created by eugeny.malyutin on 06.05.16.

    Created by eugeny.malyutin on 06.05.16. implementation of continuously EWMA/EWMVar(Exponential-Weighted)/Sig(~=z-score) updater as mllib Transformer Term, newFreq, oldEWMA,oldEWMVar -> EWStruct(newEWMA, newEWMVar, Sig)

    Formulas comes from here: T. Finch. Incremental calculation of weighted mean and variance. Technical report, University of Cambridge, 2009.

  2. class FreqStatsTransformer extends Transformer with Params with HasInputCol

    Permalink

    Created by eugeny.malyutin on 06.05.16.

    Created by eugeny.malyutin on 06.05.16.

    Transformer to count Term - Freq for text corpus distinct per document ([T1 T1 T2],[T2] -> {T1 -> 1/3 ,T2 -> 2/3})

  3. class HashBasedDeduplicator extends Transformer with Params

    Permalink

    Created by eugeny.malyutin on 05.05.16.

    Created by eugeny.malyutin on 05.05.16.

    Deduplicator based on RandomProjectionsHasher and cosine similarity.

  4. class Joiner extends Transformer with Params

    Permalink

    Created by eugeny.malyutin on 06.05.16.

    Created by eugeny.malyutin on 06.05.16.

    dataframe's join as transformer with right dataframe and col. expression as parameters used to join two dataframes through pipeline Don't save such pipelines.

  5. class LanguageAwareAnalyzer extends Transformer with HasOutputCol with Params with DefaultParamsWritable

    Permalink

    Created by eugeny.malyutin on 05.05.16.

  6. class LanguageDetectorTransformer extends Transformer with HasInputCol with HasOutputCol

    Permalink

    Created by eugeny.malyutin on 05.05.16.

    Created by eugeny.malyutin on 05.05.16.

    LanguageDetector transformer from String Column with Text creates String column with language-code or (Unknown) Bulded around optimaize langdetect

  7. class NGramExtractor extends Transformer with DefaultParamsWritable with Params with HasInputCol with HasOutputCol

    Permalink

    Created by eugeny.malyutin on 17.05.16.

    Created by eugeny.malyutin on 17.05.16. Simple NGramExtractor Transformer with option to extract from lowerNGrams to UPper together

  8. class OdklCountVectorizer extends CountVectorizer with OdklCountVectorizerParams

    Permalink
  9. class OdklCountVectorizerModel extends CountVectorizerModel with OdklCountVectorizerParams

    Permalink

    ml.feature.CountVectorizer and CountVectorizerModel extension with vocabulary or vocab size saved in outputColumn metadata as AttributeGroup

  10. trait OdklCountVectorizerParams extends Params

    Permalink

    Created by eugeny.malyutin on 06.05.16.

    Created by eugeny.malyutin on 06.05.16.

    Original CountVectorizer is badly implemented - does not conform to ML pipeline interface, uses caching in a leaking manner and etc. Here we fix part of the problems, but more is to be done.

    TODO: Get rid of Vocabulary in constructor, adapt o ModelWithSummary interface.

  11. class RandomProjectionsHasher extends Transformer with HasInputCol with HasOutputCol with HasSeed

    Permalink

    Created by eugeny.malyutin on 05.05.16.

    Created by eugeny.malyutin on 05.05.16.

    Implementation of Locality-sensitive hashing(similar vectors - similar hashes) via Random Binary Projection as ml.Transformer requires DataFrame with inputCol as linalg.Vector as data representation and output's Long column with HashValue

    If dimensions is not set - will search AttributeGroup in metadata

  12. class RegexpReplaceTransformer extends Transformer with HasInputCol with HasOutputCol with Params with DefaultParamsWritable

    Permalink

    Created by eugeny.malyutin on 12.05.16.

    Created by eugeny.malyutin on 12.05.16.

    regexp_replace in Transformer

  13. class URLElimminator extends Transformer with HasInputCol with HasOutputCol with Params with DefaultParamsWritable

    Permalink

    Created by eugeny.malyutin on 05.05.16.

    Created by eugeny.malyutin on 05.05.16.

    Transformer to remove URL's from text based on lucene UAX29URLEmailTokenizer With given column inputColumn of StringType returns outputColumn of StringType with text filtered non-url

Value Members

  1. object EWStatsTransformer extends DefaultParamsReadable[EWStatsTransformer] with Serializable

    Permalink
  2. object LanguageAwareAnalyzer extends DefaultParamsReadable[LanguageAwareAnalyzer] with Serializable

    Permalink
  3. object LanguageAwareStemmerUtil

    Permalink

    Created by eugeny.malyutin on 28.07.16.

    Created by eugeny.malyutin on 28.07.16. This object is created to wrap language detector functionality for re-using without depending on spark.ml.* (Transformer and etc)

  4. object LanguageDetectorUtils

    Permalink

    Created by eugeny.malyutin on 28.07.16.

  5. object NGramExtractor extends DefaultParamsReadable[NGramExtractor] with Serializable

    Permalink
  6. object NGramUtils

    Permalink

    Created by eugeny.malyutin on 28.07.16.

  7. object RegexpReplaceTransformer extends DefaultParamsReadable[RegexpReplaceTransformer] with Serializable

    Permalink
  8. object URLElimminator extends DefaultParamsReadable[URLElimminator] with Serializable

    Permalink
  9. object URLElimminatorUtil

    Permalink

    Created by eugeny.malyutin on 28.07.16.

Ungrouped