Language Identification and Detection by using CNN and RNN architectures in TensorFlow.
Language Identification and Detection by using CNN and RNN architectures in TensorFlow.
LanguageDetectorDL is an annotator that detects the language of documents or sentences depending on the inputCols.
The models are trained on large datasets such as Wikipedia and Tatoeba.
Depending on the language (how similar the characters are), the LanguageDetectorDL works
best with text longer than 140 characters.
The output is a language code in Wiki Code style.
Pretrained models can be loaded with pretrained of the companion object:
Val languageDetector = LanguageDetectorDL.pretrained()
.setInputCols("sentence")
.setOutputCol("language")
The default model is "ld_wiki_tatoeba_cnn_21", default language is "xx" (meaning multi-lingual),
if no values are provided.
For available pretrained models please see the Models Hub.
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val languageDetector = LanguageDetectorDL.pretrained()
.setInputCols("document")
.setOutputCol("language")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
languageDetector
))
val data = Seq(
"Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.",
"Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.",
"Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."
).toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("language.result").show(false)
+------+
|result|
+------+
|[en] |
|[fr] |
|[de] |
+------+
Language Identification and Detection by using CNN and RNN architectures in TensorFlow.
LanguageDetectorDL
is an annotator that detects the language of documents or sentences depending on the inputCols. The models are trained on large datasets such as Wikipedia and Tatoeba. Depending on the language (how similar the characters are), the LanguageDetectorDL works best with text longer than 140 characters. The output is a language code in Wiki Code style.Pretrained models can be loaded with
pretrained
of the companion object:The default model is
"ld_wiki_tatoeba_cnn_21"
, default language is"xx"
(meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.For extended examples of usage, see the Spark NLP Workshop And the LanguageDetectorDLTestSpec.
Example