Get rid of extra spaces and various extra characters and make it just words
Get rid of extra spaces and various extra characters and make it just words
For now we are going to simply strip the diacritics from the text and remove them. However, there might be more work to consider based on this: https://stackoverflow.com/a/5697575. There might be a way to compare strings "accent-insensitively" using https://github.com/unicode-org/icu/tree/master/icu4j
This class is primarily used in corpus level analytics to remove noisy things from the text to make tf_idf scoring more accurate