Inspired on Kevin Dias, Ruby implementation: https://github.com/diasks2/pragmatic_segmenter This approach extracts sentence bounds by first formatting the data with RuleSymbols and then extracting bounds with a strong RegexBased rule application
rule-based formatter that adds regex rules to different marking steps Symbols protect from ambiguous bounds to be considered splitters
Created by Saif Addin on 5/5/2017.
Reads through symbolized data, and computes the bounds based on regex rules following symbol meaning
Base Symbols that may be extended later on.
Base Symbols that may be extended later on. For now kept in the pragmatic scope.
Annotator that detects sentence boundaries using any provided approach
Annotator that detects sentence boundaries using any provided approach
See https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/test/scala/com/johnsnowlabs/nlp/annotators/sbd/pragmatic for further reference on how to use this API
This is a dictionary that contains common english abbreviations that should be considered sentence bounds
Extends RuleSymbols with specific symbols used for the pragmatic approach.
Extends RuleSymbols with specific symbols used for the pragmatic approach. Right now, the only one.
Created by Saif Addin on 5/5/2017.