Package

com.johnsnowlabs.nlp.annotators.classifier

dl

Permalink

package dl

Visibility
  1. Public
  2. All

Type Members

  1. class AlbertForTokenClassification extends AnnotatorModel[AlbertForTokenClassification] with HasBatchedAnnotate[AlbertForTokenClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties

    Permalink

    AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = AlbertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "albert_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the AlbertForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = AlbertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based classifiers

    AlbertEmbeddings for token-level embeddings

  2. class BertForSequenceClassification extends AnnotatorModel[BertForSequenceClassification] with HasBatchedAnnotate[BertForSequenceClassification] with WriteTensorflowModel with HasCaseSensitiveProperties

    Permalink

    BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = BertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "bert_base_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the BertForSequenceClassificationTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = BertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +--------------------+
    |result              |
    +--------------------+
    |[neg, neg]          |
    |[pos, pos, pos, pos]|
    +--------------------+
    See also

    Annotators Main Page for a list of transformer based classifiers

    BertEmbeddings for token-level embeddings

    BertSentenceEmbeddings for sentence-level embeddings

    BertForTokenClassification for token-level classification

  3. class BertForTokenClassification extends AnnotatorModel[BertForTokenClassification] with HasBatchedAnnotate[BertForTokenClassification] with WriteTensorflowModel with HasCaseSensitiveProperties

    Permalink

    BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = BertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "bert_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the BertForTokenClassificationTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = BertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based classifiers

    BertSentenceEmbeddings for sentence-level embeddings

    BertEmbeddings for token-level embeddings

    BertForSequenceClassification for sentence-level classification

  4. class ClassifierDLApproach extends AnnotatorApproach[ClassifierDLModel] with ParamsAndFeaturesWritable

    Permalink

    Trains a ClassifierDL for generic Multi-class Text Classification.

    Trains a ClassifierDL for generic Multi-class Text Classification.

    ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.

    For instantiated/pretrained models, see ClassifierDLModel.

    Notes:

    For extended examples of usage, see the Spark NLP Workshop [1] [2] and the ClassifierDLTestSpec.

    Example

    In this example, the training data "sentiment.csv" has the form of

    text,label
    This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
    This was a terrible movie! The acting was bad really bad!,1
    ...

    Then traning can be done like so:

    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
    import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach
    import org.apache.spark.ml.Pipeline
    
    val smallCorpus = spark.read.option("header","true").csv("src/test/resources/classifier/sentiment.csv")
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val docClassifier = new ClassifierDLApproach()
      .setInputCols("sentence_embeddings")
      .setOutputCol("category")
      .setLabelColumn("label")
      .setBatchSize(64)
      .setMaxEpochs(20)
      .setLr(5e-3f)
      .setDropout(0.5f)
    
    val pipeline = new Pipeline()
      .setStages(
        Array(
          documentAssembler,
          useEmbeddings,
          docClassifier
        )
      )
    
    val pipelineModel = pipeline.fit(smallCorpus)
    See also

    SentimentDLApproach for sentiment analysis

    MultiClassifierDLApproach for multi-class classification

  5. class ClassifierDLModel extends AnnotatorModel[ClassifierDLModel] with HasSimpleAnnotate[ClassifierDLModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable

    Permalink

    ClassifierDL for generic Multi-class Text Classification.

    ClassifierDL for generic Multi-class Text Classification.

    ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.

    This is the instantiated model of the ClassifierDLApproach. For training your own model, please see the documentation of that class.

    Pretrained models can be loaded with pretrained of the companion object:

    val classifierDL = ClassifierDLModel.pretrained()
      .setInputCols("sentence_embeddings")
      .setOutputCol("classification")

    The default model is "classifierdl_use_trec6", if no name is provided. It uses embeddings from the UniversalSentenceEncoder and is trained on the TREC-6 dataset. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the ClassifierDLTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel
    import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val sarcasmDL = ClassifierDLModel.pretrained("classifierdl_use_sarcasm")
      .setInputCols("sentence_embeddings")
      .setOutputCol("sarcasm")
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        useEmbeddings,
        sarcasmDL
      ))
    
    val data = Seq(
      "I'm ready!",
      "If I could put into words how much I love waking up at 6 am on Mondays I would."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(arrays_zip(sentence, sarcasm)) as out")
      .selectExpr("out.sentence.result as sentence", "out.sarcasm.result as sarcasm")
      .show(false)
    +-------------------------------------------------------------------------------+-------+
    |sentence                                                                       |sarcasm|
    +-------------------------------------------------------------------------------+-------+
    |I'm ready!                                                                     |normal |
    |If I could put into words how much I love waking up at 6 am on Mondays I would.|sarcasm|
    +-------------------------------------------------------------------------------+-------+
    See also

    SentimentDLModel for sentiment analysis

    MultiClassifierDLModel for multi-class classification

  6. class DistilBertForSequenceClassification extends AnnotatorModel[DistilBertForSequenceClassification] with HasBatchedAnnotate[DistilBertForSequenceClassification] with WriteTensorflowModel with HasCaseSensitiveProperties

    Permalink

    DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = DistilBertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "distilbert_base_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the DistilBertForSequenceClassificationTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = DistilBertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +--------------------+
    |result              |
    +--------------------+
    |[neg, neg]          |
    |[pos, pos, pos, pos]|
    +--------------------+
    See also

    Annotators Main Page for a list of transformer based classifiers

    DistilBertEmbeddings for token embeddings

    DistilBertForTokenClassification for token-level classification

  7. class DistilBertForTokenClassification extends AnnotatorModel[DistilBertForTokenClassification] with HasBatchedAnnotate[DistilBertForTokenClassification] with WriteTensorflowModel with HasCaseSensitiveProperties

    Permalink

    DistilBertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    DistilBertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = DistilBertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "distilbert_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the DistilBertForTokenClassificationTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = DistilBertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based classifiers

    DistilBertEmbeddings for token level embeddings

    DistilBertForSequenceClassification for sentence-level classification

  8. class LongformerForTokenClassification extends AnnotatorModel[LongformerForTokenClassification] with HasBatchedAnnotate[LongformerForTokenClassification] with WriteTensorflowModel with HasCaseSensitiveProperties

    Permalink

    LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = LongformerForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "longformer_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the LongformerForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = LongformerForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based classifiers

    LongformerEmbeddings for token-level embeddings

  9. class MultiClassifierDLApproach extends AnnotatorApproach[MultiClassifierDLModel] with ParamsAndFeaturesWritable

    Permalink

    Trains a MultiClassifierDL for Multi-label Text Classification.

    Trains a MultiClassifierDL for Multi-label Text Classification.

    MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.

    For instantiated/pretrained models, see MultiClassifierDLModel.

    The input to MultiClassifierDL are Sentence Embeddings such as the state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings.

    In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).

    Notes:

    For extended examples of usage, see the Spark NLP Workshop and the MultiClassifierDLTestSpec.

    Example

    In this example, the training data has the form (Note: labels can be arbitrary)

    mr,ref
    "name[Alimentum], area[city centre], familyFriendly[no], near[Burger King]",Alimentum is an adult establish found in the city centre area near Burger King.
    "name[Alimentum], area[city centre], familyFriendly[yes]",Alimentum is a family-friendly place in the city centre.
    ...

    It needs some pre-processing first, so the labels are of type Array[String]. This can be done like so:

    import spark.implicits._
    import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLApproach
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.sql.functions.{col, udf}
    
    // Process training data to create text with associated array of labels
    def splitAndTrim = udf { labels: String =>
      labels.split(", ").map(x=>x.trim)
    }
    
    val smallCorpus = spark.read
      .option("header", true)
      .option("inferSchema", true)
      .option("mode", "DROPMALFORMED")
      .csv("src/test/resources/classifier/e2e.csv")
      .withColumn("labels", splitAndTrim(col("mr")))
      .withColumn("text", col("ref"))
      .drop("mr")
    
    smallCorpus.printSchema()
    // root
    // |-- ref: string (nullable = true)
    // |-- labels: array (nullable = true)
    // |    |-- element: string (containsNull = true)
    
    // Then create pipeline for training
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
      .setCleanupMode("shrink")
    
    val embeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("embeddings")
    
    val docClassifier = new MultiClassifierDLApproach()
      .setInputCols("embeddings")
      .setOutputCol("category")
      .setLabelColumn("labels")
      .setBatchSize(128)
      .setMaxEpochs(10)
      .setLr(1e-3f)
      .setThreshold(0.5f)
      .setValidationSplit(0.1f)
    
    val pipeline = new Pipeline()
      .setStages(
        Array(
          documentAssembler,
          embeddings,
          docClassifier
        )
      )
    
    val pipelineModel = pipeline.fit(smallCorpus)
    See also

    SentimentDLApproach for sentiment analysis

    ClassifierDLApproach for single-class classification

    Multi-label classification on Wikipedia

  10. class MultiClassifierDLModel extends AnnotatorModel[MultiClassifierDLModel] with HasSimpleAnnotate[MultiClassifierDLModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable

    Permalink

    MultiClassifierDL for Multi-label Text Classification.

    MultiClassifierDL for Multi-label Text Classification.

    MultiClassifierDL Bidirectional GRU with Convolution model we have built inside TensorFlow and supports up to 100 classes. The input to MultiClassifierDL is Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings.

    This is the instantiated model of the MultiClassifierDLApproach. For training your own model, please see the documentation of that class.

    Pretrained models can be loaded with pretrained of the companion object:

    val multiClassifier = MultiClassifierDLModel.pretrained()
      .setInputCols("sentence_embeddings")
      .setOutputCol("categories")

    The default model is "multiclassifierdl_use_toxic", if no name is provided. It uses embeddings from the UniversalSentenceEncoder and classifies toxic comments. The data is based on the Jigsaw Toxic Comment Classification Challenge. For available pretrained models please see the Models Hub.

    In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).

    For extended examples of usage, see the Spark NLP Workshop and the MultiClassifierDLTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLModel
    import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val multiClassifierDl = MultiClassifierDLModel.pretrained()
      .setInputCols("sentence_embeddings")
      .setOutputCol("classifications")
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        useEmbeddings,
        multiClassifierDl
      ))
    
    val data = Seq(
      "This is pretty good stuff!",
      "Wtf kind of crap is this"
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("text", "classifications.result").show(false)
    +--------------------------+----------------+
    |text                      |result          |
    +--------------------------+----------------+
    |This is pretty good stuff!|[]              |
    |Wtf kind of crap is this  |[toxic, obscene]|
    +--------------------------+----------------+
    See also

    SentimentDLModel for sentiment analysis

    ClassifierDLModel for single-class classification

    Multi-label classification on Wikipedia

  11. trait ReadAlbertForTokenTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel

    Permalink
  12. trait ReadBertForSequenceTensorflowModel extends ReadTensorflowModel

    Permalink
  13. trait ReadBertForTokenTensorflowModel extends ReadTensorflowModel

    Permalink
  14. trait ReadClassifierDLTensorflowModel extends ReadTensorflowModel

    Permalink
  15. trait ReadDistilBertForSequenceTensorflowModel extends ReadTensorflowModel

    Permalink
  16. trait ReadDistilBertForTokenTensorflowModel extends ReadTensorflowModel

    Permalink
  17. trait ReadLongformerForTokenTensorflowModel extends ReadTensorflowModel

    Permalink
  18. trait ReadMultiClassifierDLTensorflowModel extends ReadTensorflowModel

    Permalink
  19. trait ReadRoBertaForTokenTensorflowModel extends ReadTensorflowModel

    Permalink
  20. trait ReadSentimentDLTensorflowModel extends ReadTensorflowModel

    Permalink
  21. trait ReadXlmRoBertaForTokenTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel

    Permalink
  22. trait ReadXlnetForTokenTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel

    Permalink
  23. trait ReadablePretrainedAlbertForTokenModel extends ParamsAndFeaturesReadable[AlbertForTokenClassification] with HasPretrained[AlbertForTokenClassification]

    Permalink
  24. trait ReadablePretrainedBertForSequenceModel extends ParamsAndFeaturesReadable[BertForSequenceClassification] with HasPretrained[BertForSequenceClassification]

    Permalink
  25. trait ReadablePretrainedBertForTokenModel extends ParamsAndFeaturesReadable[BertForTokenClassification] with HasPretrained[BertForTokenClassification]

    Permalink
  26. trait ReadablePretrainedClassifierDL extends ParamsAndFeaturesReadable[ClassifierDLModel] with HasPretrained[ClassifierDLModel]

    Permalink
  27. trait ReadablePretrainedDistilBertForSequenceModel extends ParamsAndFeaturesReadable[DistilBertForSequenceClassification] with HasPretrained[DistilBertForSequenceClassification]

    Permalink
  28. trait ReadablePretrainedDistilBertForTokenModel extends ParamsAndFeaturesReadable[DistilBertForTokenClassification] with HasPretrained[DistilBertForTokenClassification]

    Permalink
  29. trait ReadablePretrainedLongformerForTokenModel extends ParamsAndFeaturesReadable[LongformerForTokenClassification] with HasPretrained[LongformerForTokenClassification]

    Permalink
  30. trait ReadablePretrainedMultiClassifierDL extends ParamsAndFeaturesReadable[MultiClassifierDLModel] with HasPretrained[MultiClassifierDLModel]

    Permalink
  31. trait ReadablePretrainedRoBertaForTokenModel extends ParamsAndFeaturesReadable[RoBertaForTokenClassification] with HasPretrained[RoBertaForTokenClassification]

    Permalink
  32. trait ReadablePretrainedSentimentDL extends ParamsAndFeaturesReadable[SentimentDLModel] with HasPretrained[SentimentDLModel]

    Permalink
  33. trait ReadablePretrainedXlmRoBertaForTokenModel extends ParamsAndFeaturesReadable[XlmRoBertaForTokenClassification] with HasPretrained[XlmRoBertaForTokenClassification]

    Permalink
  34. trait ReadablePretrainedXlnetForTokenModel extends ParamsAndFeaturesReadable[XlnetForTokenClassification] with HasPretrained[XlnetForTokenClassification]

    Permalink
  35. class RoBertaForTokenClassification extends AnnotatorModel[RoBertaForTokenClassification] with HasBatchedAnnotate[RoBertaForTokenClassification] with WriteTensorflowModel with HasCaseSensitiveProperties

    Permalink

    RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = RoBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "roberta_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the RoBertaForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = RoBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based classifiers

    RoBertaSentenceEmbeddings for sentence-level embeddings

    RoBertaEmbeddings for token-level embeddings

  36. class SentimentDLApproach extends AnnotatorApproach[SentimentDLModel] with ParamsAndFeaturesWritable

    Permalink

    Trains a SentimentDL, an annotator for multi-class sentiment analysis.

    Trains a SentimentDL, an annotator for multi-class sentiment analysis.

    In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.

    For the instantiated/pretrained models, see SentimentDLModel.

    Notes:

    For extended examples of usage, see the Spark NLP Workshop and the SentimentDLTestSpec.

    Example

    In this example, sentiment.csv is in the form

    text,label
    This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
    This was a terrible movie! The acting was bad really bad!,1

    The model can then be trained with

    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder
    import com.johnsnowlabs.nlp.annotators.classifier.dl.{SentimentDLApproach, SentimentDLModel}
    import org.apache.spark.ml.Pipeline
    
    val smallCorpus = spark.read.option("header", "true").csv("src/test/resources/classifier/sentiment.csv")
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val docClassifier = new SentimentDLApproach()
      .setInputCols("sentence_embeddings")
      .setOutputCol("sentiment")
      .setLabelColumn("label")
      .setBatchSize(32)
      .setMaxEpochs(1)
      .setLr(5e-3f)
      .setDropout(0.5f)
    
    val pipeline = new Pipeline()
      .setStages(
        Array(
          documentAssembler,
          useEmbeddings,
          docClassifier
        )
      )
    
    val pipelineModel = pipeline.fit(smallCorpus)
    See also

    MultiClassifierDLApproach for general multi-class classification

    ClassifierDLApproach for general single-class classification

  37. class SentimentDLModel extends AnnotatorModel[SentimentDLModel] with HasSimpleAnnotate[SentimentDLModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable

    Permalink

    SentimentDL, an annotator for multi-class sentiment analysis.

    SentimentDL, an annotator for multi-class sentiment analysis.

    In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.

    This is the instantiated model of the SentimentDLApproach. For training your own model, please see the documentation of that class.

    Pretrained models can be loaded with pretrained of the companion object:

    val sentiment = SentimentDLModel.pretrained()
      .setInputCols("sentence_embeddings")
      .setOutputCol("sentiment")

    The default model is "sentimentdl_use_imdb", if no name is provided. It is english sentiment analysis trained on the IMDB dataset. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the SentimentDLTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder
    import com.johnsnowlabs.nlp.annotators.classifier.dl.SentimentDLModel
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val sentiment = SentimentDLModel.pretrained("sentimentdl_use_twitter")
      .setInputCols("sentence_embeddings")
      .setThreshold(0.7F)
      .setOutputCol("sentiment")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      useEmbeddings,
      sentiment
    ))
    
    val data = Seq(
      "Wow, the new video is awesome!",
      "bruh what a damn waste of time"
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("text", "sentiment.result").show(false)
    +------------------------------+----------+
    |text                          |result    |
    +------------------------------+----------+
    |Wow, the new video is awesome!|[positive]|
    |bruh what a damn waste of time|[negative]|
    +------------------------------+----------+
    See also

    MultiClassifierDLModel for general multi-class classification

    ClassifierDLModel for general single-class classification

  38. class XlmRoBertaForTokenClassification extends AnnotatorModel[XlmRoBertaForTokenClassification] with HasBatchedAnnotate[XlmRoBertaForTokenClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties

    Permalink

    XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = XlmRoBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "xlm_roberta_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the XlmRoBertaForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = XlmRoBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based classifiers

    XlmRoBertaSentenceEmbeddings for sentence-level embeddings

    XlmRoBertaEmbeddings for token-level embeddings

  39. class XlnetForTokenClassification extends AnnotatorModel[XlnetForTokenClassification] with HasBatchedAnnotate[XlnetForTokenClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties

    Permalink

    XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = XlnetForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "xlnet_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the XlnetForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = XlnetForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based classifiers

    XlnetEmbeddings for token-level embeddings

Value Members

  1. object AlbertForTokenClassification extends ReadablePretrainedAlbertForTokenModel with ReadAlbertForTokenTensorflowModel with Serializable

    Permalink

    This is the companion object of AlbertForTokenClassification.

    This is the companion object of AlbertForTokenClassification. Please refer to that class for the documentation.

  2. object BertForSequenceClassification extends ReadablePretrainedBertForSequenceModel with ReadBertForSequenceTensorflowModel with Serializable

    Permalink

    This is the companion object of BertForSequenceClassification.

    This is the companion object of BertForSequenceClassification. Please refer to that class for the documentation.

  3. object BertForTokenClassification extends ReadablePretrainedBertForTokenModel with ReadBertForTokenTensorflowModel with Serializable

    Permalink

    This is the companion object of BertForTokenClassification.

    This is the companion object of BertForTokenClassification. Please refer to that class for the documentation.

  4. object ClassifierDLApproach extends DefaultParamsReadable[ClassifierDLApproach] with Serializable

    Permalink

    This is the companion object of ClassifierDLApproach.

    This is the companion object of ClassifierDLApproach. Please refer to that class for the documentation.

  5. object ClassifierDLModel extends ReadablePretrainedClassifierDL with ReadClassifierDLTensorflowModel with Serializable

    Permalink

    This is the companion object of ClassifierDLModel.

    This is the companion object of ClassifierDLModel. Please refer to that class for the documentation.

  6. object DistilBertForSequenceClassification extends ReadablePretrainedDistilBertForSequenceModel with ReadDistilBertForSequenceTensorflowModel with Serializable

    Permalink

    This is the companion object of DistilBertForSequenceClassification.

    This is the companion object of DistilBertForSequenceClassification. Please refer to that class for the documentation.

  7. object DistilBertForTokenClassification extends ReadablePretrainedDistilBertForTokenModel with ReadDistilBertForTokenTensorflowModel with Serializable

    Permalink

    This is the companion object of DistilBertForTokenClassification.

    This is the companion object of DistilBertForTokenClassification. Please refer to that class for the documentation.

  8. object LongformerForTokenClassification extends ReadablePretrainedLongformerForTokenModel with ReadLongformerForTokenTensorflowModel with Serializable

    Permalink

    This is the companion object of LongformerForTokenClassification.

    This is the companion object of LongformerForTokenClassification. Please refer to that class for the documentation.

  9. object MultiClassifierDLModel extends ReadablePretrainedMultiClassifierDL with ReadMultiClassifierDLTensorflowModel with Serializable

    Permalink

    This is the companion object of MultiClassifierDLModel.

    This is the companion object of MultiClassifierDLModel. Please refer to that class for the documentation.

  10. object RoBertaForTokenClassification extends ReadablePretrainedRoBertaForTokenModel with ReadRoBertaForTokenTensorflowModel with Serializable

    Permalink

    This is the companion object of RoBertaForTokenClassification.

    This is the companion object of RoBertaForTokenClassification. Please refer to that class for the documentation.

  11. object SentimentApproach extends DefaultParamsReadable[SentimentDLApproach]

    Permalink

    This is the companion object of SentimentApproach.

    This is the companion object of SentimentApproach. Please refer to that class for the documentation.

  12. object SentimentDLModel extends ReadablePretrainedSentimentDL with ReadSentimentDLTensorflowModel with Serializable

    Permalink

    This is the companion object of SentimentDLModel.

    This is the companion object of SentimentDLModel. Please refer to that class for the documentation.

  13. object XlmRoBertaForTokenClassification extends ReadablePretrainedXlmRoBertaForTokenModel with ReadXlmRoBertaForTokenTensorflowModel with Serializable

    Permalink

    This is the companion object of XlmRoBertaForTokenClassification.

    This is the companion object of XlmRoBertaForTokenClassification. Please refer to that class for the documentation.

  14. object XlnetForTokenClassification extends ReadablePretrainedXlnetForTokenModel with ReadXlnetForTokenTensorflowModel with Serializable

    Permalink

    This is the companion object of XlnetForTokenClassification.

    This is the companion object of XlnetForTokenClassification. Please refer to that class for the documentation.

Ungrouped