DocumentAssembler

Companion object DocumentAssembler

class DocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol

Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The DocumentAssembler reads String columns. Additionally, setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer to the parameters section.

For more extended examples on document pre-processing see the Examples.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler

val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val result = documentAssembler.transform(data)

result.select("document").show(false)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

Linear Supertypes

HasOutputAnnotationCol, HasOutputAnnotatorType, DefaultParamsWritable, MLWritable, Transformer, PipelineStage, Logging, Params, Serializable, Identifiable, AnyRef, Any

Ordering

Grouped
Alphabetic
By Inheritance

Inherited

DocumentAssembler
HasOutputAnnotationCol
HasOutputAnnotatorType
DefaultParamsWritable
MLWritable
Transformer
PipelineStage
Logging
Params
Serializable
Identifiable
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Instance Constructors

new DocumentAssembler()
new DocumentAssembler(uid: String)
uid
required uid for storing annotator to disk

Type Members

type AnnotatorType = String
Definition Classes
HasOutputAnnotatorType

Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def $[T](param: Param[T]): T
Attributes
protected
Definition Classes
Params
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
val EMPTY_STR: String
final def asInstanceOf[T0]: T0
Definition Classes
Any
val cleanupMode: Param[String]
cleanupMode can take the following values:
cleanupMode can take the following values:
- disabled: keep original. Useful if need to head back to source later
- inplace: newlines and tabs into whitespaces, not stringified ones, don't trim
- inplace_full: newlines and tabs into whitespaces, including stringified, don't trim
- shrink: all whitespaces, newlines and tabs to a single whitespace, but not stringified, do trim
- shrink_full: all whitespaces, newlines and tabs to a single whitespace, stringified ones too, trim all
- each: newlines and tabs to one whitespace each
- each_full: newlines and tabs, stringified ones too, to one whitespace each
- delete_full: remove stringified newlines and tabs (replace with nothing)
final def clear(param: Param[_]): DocumentAssembler.this.type
Definition Classes
Params
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @HotSpotIntrinsicCandidate() @native()
def copy(extra: ParamMap): Transformer
Definition Classes
DocumentAssembler → Transformer → PipelineStage → Params
def copyValues[T <: Params](to: T, extra: ParamMap): T
Attributes
protected
Definition Classes
Params
final def defaultCopy[T <: Params](extra: ParamMap): T
Attributes
protected
Definition Classes
Params
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
def explainParam(param: Param[_]): String
Definition Classes
Params
def explainParams(): String
Definition Classes
Params
final def extractParamMap(): ParamMap
Definition Classes
Params
final def extractParamMap(extra: ParamMap): ParamMap
Definition Classes
Params
final def get[T](param: Param[T]): Option[T]
Definition Classes
Params
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@HotSpotIntrinsicCandidate() @native()
def getCleanupMode: String
cleanupMode to pre-process text
final def getDefault[T](param: Param[T]): Option[T]
Definition Classes
Params
def getIdCol: String
Id column for row reference
def getInputCol: String
Input text column for processing
def getMetadataCol: String
Metadata for document column
final def getOrDefault[T](param: Param[T]): T
Definition Classes
Params
final def getOutputCol: String
Gets annotation column name going to generate
Gets annotation column name going to generate
Definition Classes
HasOutputAnnotationCol
def getParam(paramName: String): Param[Any]
Definition Classes
Params
final def hasDefault[T](param: Param[T]): Boolean
Definition Classes
Params
def hasParam(paramName: String): Boolean
Definition Classes
Params
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@HotSpotIntrinsicCandidate() @native()
val idCol: Param[String]
Id column for row reference
def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
Attributes
protected
Definition Classes
Logging
def initializeLogIfNecessary(isInterpreter: Boolean): Unit
Attributes
protected
Definition Classes
Logging
val inputCol: Param[String]
Input text column for processing
final def isDefined(param: Param[_]): Boolean
Definition Classes
Params
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
final def isSet(param: Param[_]): Boolean
Definition Classes
Params
def isTraceEnabled(): Boolean
Attributes
protected
Definition Classes
Logging
def log: Logger
Attributes
protected
Definition Classes
Logging
def logDebug(msg: => String, throwable: Throwable): Unit
Attributes
protected
Definition Classes
Logging
def logDebug(msg: => String): Unit
Attributes
protected
Definition Classes
Logging
def logError(msg: => String, throwable: Throwable): Unit
Attributes
protected
Definition Classes
Logging
def logError(msg: => String): Unit
Attributes
protected
Definition Classes
Logging
def logInfo(msg: => String, throwable: Throwable): Unit
Attributes
protected
Definition Classes
Logging
def logInfo(msg: => String): Unit
Attributes
protected
Definition Classes
Logging
def logName: String
Attributes
protected
Definition Classes
Logging
def logTrace(msg: => String, throwable: Throwable): Unit
Attributes
protected
Definition Classes
Logging
def logTrace(msg: => String): Unit
Attributes
protected
Definition Classes
Logging
def logWarning(msg: => String, throwable: Throwable): Unit
Attributes
protected
Definition Classes
Logging
def logWarning(msg: => String): Unit
Attributes
protected
Definition Classes
Logging
val metadataCol: Param[String]
Metadata for document column
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@HotSpotIntrinsicCandidate() @native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@HotSpotIntrinsicCandidate() @native()
val outputAnnotatorType: AnnotatorType
Output Annotator Type: DOCUMENT
Output Annotator Type: DOCUMENT
Definition Classes
DocumentAssembler → HasOutputAnnotatorType
final val outputCol: Param[String]
Attributes
protected
Definition Classes
HasOutputAnnotationCol
lazy val params: Array[Param[_]]
Definition Classes
Params
def save(path: String): Unit
Definition Classes
MLWritable
Annotations
@throws("If the input path already exists but overwrite is not enabled.") @Since("1.6.0")
final def set(paramPair: ParamPair[_]): DocumentAssembler.this.type
Attributes
protected
Definition Classes
Params
final def set(param: String, value: Any): DocumentAssembler.this.type
Attributes
protected
Definition Classes
Params
final def set[T](param: Param[T], value: T): DocumentAssembler.this.type
Definition Classes
Params
def setCleanupMode(v: String): DocumentAssembler.this.type
cleanupMode to pre-process text
final def setDefault(paramPairs: ParamPair[_]*): DocumentAssembler.this.type
Attributes
protected
Definition Classes
Params
final def setDefault[T](param: Param[T], value: T): DocumentAssembler.this.type
Attributes
protected[org.apache.spark.ml]
Definition Classes
Params
def setIdCol(value: String): DocumentAssembler.this.type
Id column for row reference
def setInputCol(value: String): DocumentAssembler.this.type
Input text column for processing
def setMetadataCol(value: String): DocumentAssembler.this.type
Metadata for document column
final def setOutputCol(value: String): DocumentAssembler.this.type
Overrides annotation column name when transforming
Overrides annotation column name when transforming
Definition Classes
HasOutputAnnotationCol
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
Identifiable → AnyRef → Any
def transform(dataset: Dataset[_]): DataFrame
Definition Classes
DocumentAssembler → Transformer
def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame
Definition Classes
Transformer
Annotations
@Since("2.0.0")
def transform(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DataFrame
Definition Classes
Transformer
Annotations
@varargs() @Since("2.0.0")
final def transformSchema(schema: StructType): StructType
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
Definition Classes
DocumentAssembler → PipelineStage
def transformSchema(schema: StructType, logging: Boolean): StructType
Attributes
protected
Definition Classes
PipelineStage
Annotations
@DeveloperApi()
val uid: String
Definition Classes
DocumentAssembler → Identifiable
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
def write: MLWriter
Definition Classes
DefaultParamsWritable → MLWritable

Deprecated Value Members

def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable]) @Deprecated
Deprecated
(Since version 9)

Inherited from HasOutputAnnotationCol

Inherited from HasOutputAnnotatorType

Inherited from DefaultParamsWritable

Inherited from MLWritable

Inherited from Transformer

Inherited from PipelineStage

Inherited from Logging

Inherited from Params

Inherited from Serializable

Inherited from Identifiable

Inherited from AnyRef

Inherited from Any

Parameters

A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.

Packages

DocumentAssembler

Companion object DocumentAssembler

class DocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol

Example

Instance Constructors

Type Members

Value Members

Deprecated Value Members

Inherited from HasOutputAnnotationCol

Inherited from HasOutputAnnotatorType

Inherited from DefaultParamsWritable

Inherited from MLWritable

Inherited from Transformer

Inherited from PipelineStage

Inherited from Logging

Inherited from Params

Inherited from Serializable

Inherited from Identifiable

Inherited from AnyRef

Inherited from Any

Parameters

Annotator types

Members

Parameter setters

Parameter getters

Packages

DocumentAssembler

Companion object DocumentAssembler

class DocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol

Example

Instance Constructors

Type Members

Value Members

Deprecated Value Members

Inherited from HasOutputAnnotationCol

Inherited from HasOutputAnnotatorType

Inherited from DefaultParamsWritable

Inherited from MLWritable

Inherited from Transformer

Inherited from PipelineStage

Inherited from Logging

Inherited from Params

Inherited from Serializable

Inherited from Identifiable

Inherited from AnyRef

Inherited from Any

Parameters

Annotator types

Members

Parameter setters

Parameter getters

DocumentAssembler