pragmatic

Type Members

class CustomPragmaticMethod extends PragmaticMethod with Serializable

Inspired on Kevin Dias, Ruby implementation: https://github.com/diasks2/pragmatic_segmenter This approach extracts sentence bounds by first formatting the data with RuleSymbols and then extracting bounds with a strong RegexBased rule application
class DefaultPragmaticMethod extends PragmaticMethod with Serializable
class MixedPragmaticMethod extends PragmaticMethod with Serializable
class PragmaticContentFormatter extends AnyRef

rule-based formatter that adds regex rules to different marking steps Symbols protect from ambiguous bounds to be considered splitters
trait PragmaticMethod extends AnyRef

Attributes
protected
class PragmaticSentenceExtractor extends AnyRef

Reads through symbolized data, and computes the bounds based on regex rules following symbol meaning
trait RuleSymbols extends AnyRef

Base Symbols that may be extended later on.
Base Symbols that may be extended later on. For now kept in the pragmatic scope.

class SentenceDetector extends AnnotatorModel[SentenceDetector] with HasSimpleAnnotate[SentenceDetector] with SentenceDetectorParams

Annotator that detects sentence boundaries using any provided approach.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Spark NLP Workshop.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence
))

val data = Seq("This is my first sentence. This my second. How about a third?").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(sentence) as sentences").show(false)
+------------------------------------------------------------------+
|sentences                                                         |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []]          |
|[document, 43, 60, How about a third?, [sentence -> 2], []]       |
+------------------------------------------------------------------+

See also: SentenceDetectorDLModel for pretrained models

Value Members

object PragmaticContentFormatter
object PragmaticDictionaries

This is a dictionary that contains common english abbreviations that should be considered sentence bounds
object PragmaticSymbols extends RuleSymbols

Extends RuleSymbols with specific symbols used for the pragmatic approach.
Extends RuleSymbols with specific symbols used for the pragmatic approach. Right now, the only one.
object SentenceDetector extends DefaultParamsReadable[SentenceDetector] with Serializable

This is the companion object of SentenceDetector.
This is the companion object of SentenceDetector. Please refer to that class for the documentation.

package pragmatic

Type Members

class CustomPragmaticMethod extends PragmaticMethod with Serializable

class DefaultPragmaticMethod extends PragmaticMethod with Serializable

class MixedPragmaticMethod extends PragmaticMethod with Serializable

class PragmaticContentFormatter extends AnyRef

trait PragmaticMethod extends AnyRef

class PragmaticSentenceExtractor extends AnyRef

trait RuleSymbols extends AnyRef

class SentenceDetector extends AnnotatorModel[SentenceDetector] with HasSimpleAnnotate[SentenceDetector] with SentenceDetectorParams

Example

Value Members

object PragmaticContentFormatter

object PragmaticDictionaries

object PragmaticSymbols extends RuleSymbols

object SentenceDetector extends DefaultParamsReadable[SentenceDetector] with Serializable

Ungrouped