public class TerminoExtractor
extends java.lang.Object
TermSuitePreprocessor
Modifier and Type | Class and Description |
---|---|
static class |
TerminoExtractor.ContextualizerMode |
Constructor and Description |
---|
TerminoExtractor() |
Modifier and Type | Method and Description |
---|---|
TerminoExtractor |
dynamicMaxSizeFilter(int maxTermIndexSize)
Filters the
TermIndex dynamically during the term spotting phase (RegexSpotter)
of terminology extraction by cleaning by frequency whenever the number of terms in-memory
exceeds a max number of terms allowed. |
TermIndex |
execute() |
static TerminoExtractor |
fromDocumentCollection(Lang lang,
java.util.Collection<Document> documents) |
static TerminoExtractor |
fromDocumentStream(Lang lang,
java.util.stream.Stream<Document> documentStream,
long streamSize) |
static TerminoExtractor |
fromPreprocessedDocumentStream(Lang lang,
java.util.stream.Stream<org.apache.uima.jcas.JCas> casStream,
long streamSize) |
static TerminoExtractor |
fromPreprocessedJsonFiles(Lang lang,
java.lang.String directory) |
static TerminoExtractor |
fromPreprocessedJsonFiles(Lang lang,
java.lang.String directory,
java.lang.String encoding) |
static TerminoExtractor |
fromPreprocessedXmiFiles(Lang lang,
java.lang.String directory)
WARNING : encoding of XMI file must be UTF-8.
|
static TerminoExtractor |
fromSingleDocument(Lang lang,
Document document) |
static TerminoExtractor |
fromSinglePreprocessedDocument(Lang lang,
org.apache.uima.jcas.JCas cas) |
static TerminoExtractor |
fromTextString(Lang lang,
java.lang.String text) |
static TerminoExtractor |
fromTxtCorpus(Lang lang,
java.lang.String directory,
java.lang.String pattern) |
static TerminoExtractor |
fromTxtCorpus(Lang lang,
java.lang.String directory,
java.lang.String pattern,
java.lang.String encoding) |
TerminoExtractor |
postFilter(TerminoFilterConfig filterConfig)
Filters the
TermIndex at the end of the pipeline,
i.e. after the term variant detection phase. |
TerminoExtractor |
preFilter(TerminoFilterConfig filterConfig)
Filters the
TermIndex before the term variant detection phase. |
TerminoExtractor |
setTreeTaggerHome(java.lang.String treeTaggerHome) |
TerminoExtractor |
setWatcher(TermHistory history) |
TerminoExtractor |
useContextualizer(int scope,
TerminoExtractor.ContextualizerMode contextualizerMode) |
TerminoExtractor |
usingCustomResources(java.lang.String resourceDir) |
public static TerminoExtractor fromTextString(Lang lang, java.lang.String text)
public static TerminoExtractor fromSingleDocument(Lang lang, Document document)
public static TerminoExtractor fromPreprocessedJsonFiles(Lang lang, java.lang.String directory)
public static TerminoExtractor fromPreprocessedJsonFiles(Lang lang, java.lang.String directory, java.lang.String encoding)
public static TerminoExtractor fromPreprocessedXmiFiles(Lang lang, java.lang.String directory)
public static TerminoExtractor fromPreprocessedDocumentStream(Lang lang, java.util.stream.Stream<org.apache.uima.jcas.JCas> casStream, long streamSize)
public static TerminoExtractor fromDocumentStream(Lang lang, java.util.stream.Stream<Document> documentStream, long streamSize)
public static TerminoExtractor fromDocumentCollection(Lang lang, java.util.Collection<Document> documents)
public static TerminoExtractor fromTxtCorpus(Lang lang, java.lang.String directory, java.lang.String pattern)
public static TerminoExtractor fromTxtCorpus(Lang lang, java.lang.String directory, java.lang.String pattern, java.lang.String encoding)
public static TerminoExtractor fromSinglePreprocessedDocument(Lang lang, org.apache.uima.jcas.JCas cas)
public TerminoExtractor setTreeTaggerHome(java.lang.String treeTaggerHome)
public TerminoExtractor usingCustomResources(java.lang.String resourceDir)
public TerminoExtractor useContextualizer(int scope, TerminoExtractor.ContextualizerMode contextualizerMode)
public TerminoExtractor preFilter(TerminoFilterConfig filterConfig)
TermIndex
before the term variant detection phase.
This early-stage filtering will result in missing several low-frequency variations
during the term variation detection but is often necessary
when detecting variant takes too long.filterConfig
- The filtering configurationTerminoExtractor
launcher classpostFilter(TerminoFilterConfig)
public TerminoExtractor dynamicMaxSizeFilter(int maxTermIndexSize)
TermIndex
dynamically during the term spotting phase (RegexSpotter)
of terminology extraction by cleaning by frequency whenever the number of terms in-memory
exceeds a max number of terms allowed.maxTermIndexSize
- the maximum number of Term
instances allowed to be kept in memory
during the terminology extraction process.TerminoExtractor
launcher classTermSuitePipeline.aeMaxSizeThresholdCleaner(TermProperty, int)
public TerminoExtractor postFilter(TerminoFilterConfig filterConfig)
TermIndex
at the end of the pipeline,
i.e. after the term variant detection phase.
This filtering is loss-less when configured with TerminoFilterConfig#keepVariants(true)
.filterConfig
- The filtering configurationTerminoExtractor
launcher classpreFilter(TerminoFilterConfig)
public TermIndex execute()
public TerminoExtractor setWatcher(TermHistory history)