Package

com.johnsnowlabs.nlp.annotators.spell

context

Permalink

package context

Visibility
  1. Public
  2. All

Type Members

  1. class ContextSpellCheckerApproach extends AnnotatorApproach[ContextSpellCheckerModel] with HasFeatures with WeightedLevenshtein

    Permalink

    Trains a deep-learning based Noisy Channel Model Spell Algorithm.

    Trains a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.

    For instantiated/pretrained models, see ContextSpellCheckerModel.

    Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:

    1. Different correction candidates for each word — word level.
    2. The surrounding text of each word, i.e. it’s context — sentence level.
    3. The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

    For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

    For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Spark NLP Workshop and the ContextSpellCheckerTestSpec.

    Example

    For this example, we use the first Sherlock Holmes book as the training dataset.

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach
    
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val spellChecker = new ContextSpellCheckerApproach()
      .setInputCols("token")
      .setOutputCol("corrected")
      .setWordMaxDistance(3)
      .setBatchSize(24)
      .setEpochs(8)
      .setLanguageModelClasses(1650)  // dependant on vocabulary size
      // .addVocabClass("_NAME_", names) // Extra classes for correction could be added like this
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      spellChecker
    ))
    
    val path = "src/test/resources/spell/sherlockholmes.txt"
    val dataset = spark.sparkContext.textFile(path)
      .toDF("text")
    val pipelineModel = pipeline.fit(dataset)
    See also

    NorvigSweetingApproach and SymmetricDeleteApproach for alternative approaches to spell checking

  2. class ContextSpellCheckerModel extends AnnotatorModel[ContextSpellCheckerModel] with HasSimpleAnnotate[ContextSpellCheckerModel] with WeightedLevenshtein with WriteTensorflowModel with ParamsAndFeaturesWritable with HasTransducerFeatures

    Permalink

    Implements a deep-learning based Noisy Channel Model Spell Algorithm.

    Implements a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.

    Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:

    1. Different correction candidates for each word — word level.
    2. The surrounding text of each word, i.e. it’s context — sentence level.
    3. The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

    For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

    This is the instantiated model of the ContextSpellCheckerApproach. For training your own model, please see the documentation of that class.

    Pretrained models can be loaded with pretrained of the companion object:

    val spellChecker = ContextSpellCheckerModel.pretrained()
      .setInputCols("token")
      .setOutputCol("checked")

    The default model is "spellcheck_dl", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the ContextSpellCheckerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerModel
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("doc")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("doc"))
      .setOutputCol("token")
    
    val spellChecker = ContextSpellCheckerModel
      .pretrained()
      .setTradeOff(12.0f)
      .setInputCols("token")
      .setOutputCol("checked")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      spellChecker
    ))
    
    val data = Seq("It was a cold , dreary day and the country was white with smow .").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("checked.result").show(false)
    +--------------------------------------------------------------------------------+
    |result                                                                          |
    +--------------------------------------------------------------------------------+
    |[It, was, a, cold, ,, dreary, day, and, the, country, was, white, with, snow, .]|
    +--------------------------------------------------------------------------------+
    See also

    NorvigSweetingModel and SymmetricDeleteModel for alternative approaches to spell checking

  3. trait HasTransducerFeatures extends HasFeatures

    Permalink
  4. case class LangModelSentence(ids: Array[Int], cids: Array[Int], cwids: Array[Int], len: Int) extends Product with Serializable

    Permalink
  5. trait ReadablePretrainedContextSpell extends ReadsLanguageModelGraph with HasPretrained[ContextSpellCheckerModel]

    Permalink
  6. trait ReadsLanguageModelGraph extends ParamsAndFeaturesReadable[ContextSpellCheckerModel] with ReadTensorflowModel

    Permalink
  7. trait WeightedLevenshtein extends AnyRef

    Permalink

Value Members

  1. object CandidateStrategy

    Permalink
  2. object ContextSpellCheckerModel extends ReadablePretrainedContextSpell with Serializable

    Permalink

    This is the companion object of ContextSpellCheckerModel.

    This is the companion object of ContextSpellCheckerModel. Please refer to that class for the documentation.

  3. package parser

    Permalink

Ungrouped