BaseTextVectorizer (deeplearning4j-nlp 0.4-rc0 API)

java.lang.Object
- org.deeplearning4j.bagofwords.vectorizer.BaseTextVectorizer

All Implemented Interfaces:

Serializable, TextVectorizer, Vectorizer

Direct Known Subclasses:

BagOfWordsVectorizer, TfidfVectorizer
```
public abstract class BaseTextVectorizer
extends Object
implements TextVectorizer
```
Base text vectorizer for handling creation of vocab

Author:

Adam Gibson

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`protected int`	`batchSize`
`protected VocabCache`	`cache`
`protected DocumentIterator`	`docIter`
`protected InvertedIndex`	`index`
`protected List<String>`	`labels`
`protected LabelAwareSentenceIterator`	`labelSentenceIter`
`protected int`	`minWordFrequency`
`protected AtomicLong`	`numWordsEncountered`
`protected double`	`sample`
`protected SentenceIterator`	`sentenceIterator`
`protected boolean`	`stem`
`protected List<String>`	`stopWords`
`protected TokenizerFactory`	`tokenizerFactory`
`protected akka.actor.ActorSystem`	`trainingSystem`

Constructor Summary

Constructors
Modifier	Constructor and Description
	`BaseTextVectorizer()`
`protected`	`BaseTextVectorizer(VocabCache cache, TokenizerFactory tokenizerFactory, List<String> stopWords, int minWordFrequency, DocumentIterator docIter, SentenceIterator sentenceIterator, List<String> labels, InvertedIndex index, int batchSize, double sample, boolean stem, boolean cleanup)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`int`	`batchSize()` For word vectors, this is the batch size for how to partition documents in to workloads
`void`	`fit()` Train the model
`VocabCache`	`getCache()`
`DocumentIterator`	`getDocIter()`
`int`	`getMinWordFrequency()`
`SentenceIterator`	`getSentenceIterator()`
`List<String>`	`getStopWords()`
`TokenizerFactory`	`getTokenizerFactory()`
`InvertedIndex`	`index()` Inverted index
`long`	`numWordsEncountered()` Returns the number of words encountered so far
`double`	`sample()` Sampling for building mini batches
`void`	`setCache(VocabCache cache)`
`void`	`setDocIter(DocumentIterator docIter)`
`void`	`setMinWordFrequency(int minWordFrequency)`
`void`	`setSentenceIterator(SentenceIterator sentenceIterator)`
`void`	`setStopWords(List<String> stopWords)`
`void`	`setTokenizerFactory(TokenizerFactory tokenizerFactory)`
`VocabCache`	`vocab()` The vocab sorted in descending order

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.deeplearning4j.bagofwords.vectorizer.TextVectorizer
transform, vectorize, vectorize, vectorize

Methods inherited from interface org.deeplearning4j.datasets.vectorizer.Vectorizer
vectorize

Field Detail

cache
```
protected transient VocabCache cache
```

trainingSystem

protected transient akka.actor.ActorSystem trainingSystem

tokenizerFactory

protected transient TokenizerFactory tokenizerFactory

stopWords
```
protected List<String> stopWords
```

minWordFrequency
```
protected int minWordFrequency
```

docIter

protected transient DocumentIterator docIter

labels
```
protected List<String> labels
```

sentenceIterator

protected transient SentenceIterator sentenceIterator

labelSentenceIter

protected transient LabelAwareSentenceIterator labelSentenceIter

numWordsEncountered

protected AtomicLong numWordsEncountered

index
```
protected InvertedIndex index
```

batchSize
```
protected int batchSize
```

sample
```
protected double sample
```

stem
```
protected boolean stem
```

Constructor Detail

BaseTextVectorizer
```
public BaseTextVectorizer()
```

BaseTextVectorizer

protected BaseTextVectorizer(VocabCache cache,
                             TokenizerFactory tokenizerFactory,
                             List<String> stopWords,
                             int minWordFrequency,
                             DocumentIterator docIter,
                             SentenceIterator sentenceIterator,
                             List<String> labels,
                             InvertedIndex index,
                             int batchSize,
                             double sample,
                             boolean stem,
                             boolean cleanup)

Method Detail
- batchSize
```
public int batchSize()
```
  Description copied from interface: TextVectorizer
  
  For word vectors, this is the batch size for how to partition documents in to workloads
  
  Specified by:
  
  batchSize in interface TextVectorizer
  
  Returns:
  
  the batchsize for partitioning documents in to workloads
- sample
```
public double sample()
```
  Description copied from interface: TextVectorizer
  
  Sampling for building mini batches
  
  Specified by:
  
  sample in interface TextVectorizer
  
  Returns:
  
  the sampling
- fit
```
public void fit()
```
  Description copied from interface: TextVectorizer
  
  Train the model
  
  Specified by:
  
  fit in interface TextVectorizer
- vocab
```
public VocabCache vocab()
```
  Description copied from interface: TextVectorizer
  
  The vocab sorted in descending order
  
  Specified by:
  
  vocab in interface TextVectorizer
  
  Returns:
  
  the vocab sorted in descending order
- getSentenceIterator
```
public SentenceIterator getSentenceIterator()
```
- setSentenceIterator
```
public void setSentenceIterator(SentenceIterator sentenceIterator)
```
- getDocIter
```
public DocumentIterator getDocIter()
```
- setDocIter
```
public void setDocIter(DocumentIterator docIter)
```
- getMinWordFrequency
```
public int getMinWordFrequency()
```
- setMinWordFrequency
```
public void setMinWordFrequency(int minWordFrequency)
```
- getStopWords
```
public List<String> getStopWords()
```
- setStopWords
```
public void setStopWords(List<String> stopWords)
```
- getTokenizerFactory
```
public TokenizerFactory getTokenizerFactory()
```
- setTokenizerFactory
```
public void setTokenizerFactory(TokenizerFactory tokenizerFactory)
```
- getCache
```
public VocabCache getCache()
```
- setCache
```
public void setCache(VocabCache cache)
```
- numWordsEncountered
```
public long numWordsEncountered()
```
  Description copied from interface: TextVectorizer
  
  Returns the number of words encountered so far
  
  Specified by:
  
  numWordsEncountered in interface TextVectorizer
  
  Returns:
  
  the number of words encountered so far
- index
```
public InvertedIndex index()
```
  Description copied from interface: TextVectorizer
  
  Inverted index
  
  Specified by:
  
  index in interface TextVectorizer
  
  Returns:
  
  the inverted index for this vectorizer

Class BaseTextVectorizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.deeplearning4j.bagofwords.vectorizer.TextVectorizer

Methods inherited from interface org.deeplearning4j.datasets.vectorizer.Vectorizer

Field Detail

cache

trainingSystem

tokenizerFactory

stopWords

minWordFrequency

docIter

labels

sentenceIterator

labelSentenceIter

numWordsEncountered

index

batchSize

sample

stem

Constructor Detail

BaseTextVectorizer

BaseTextVectorizer

Method Detail

batchSize

sample

fit

vocab

getSentenceIterator

setSentenceIterator

getDocIter

setDocIter

getMinWordFrequency

setMinWordFrequency

getStopWords

setStopWords

getTokenizerFactory

setTokenizerFactory

getCache

setCache

numWordsEncountered

index