public class EnglishTextTokenizer extends java.lang.Object implements TextTokenizer
A TextTokenizer
implementation for the English languages.
Constructor and Description |
---|
EnglishTextTokenizer() |
Modifier and Type | Method and Description |
---|---|
protected java.lang.String |
convertWord(java.lang.String word)
Converts a
word into all upper case and checks if it
is a known stop word in english language. |
java.util.Set<java.lang.String> |
stopWords()
Gets all stop-words for a language.
|
java.util.Set<java.lang.String> |
tokenize(java.lang.String text)
Tokenize a
text and discards all stop-words from it. |
public java.util.Set<java.lang.String> tokenize(java.lang.String text) throws java.io.IOException
TextTokenizer
Tokenize a text
and discards all stop-words from it.
tokenize
in interface TextTokenizer
text
- the text to tokenizejava.io.IOException
- if a low-level I/O error occurs.public java.util.Set<java.lang.String> stopWords()
TextTokenizer
Gets all stop-words for a language.
stopWords
in interface TextTokenizer
protected java.lang.String convertWord(java.lang.String word)
Converts a word
into all upper case and checks if it
is a known stop word in english language. If it is,
then the word
will be discarded and will not be
considered as a valid token.
word
- the word