class MultiDocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType
Prepares data into a format that is processable by Spark NLP. This is the entry point for
every Spark NLP pipeline. The MultiDocumentAssembler can read either a String column or an
Array[String]. Additionally, setCleanupMode can be used to pre-process the text
(Default: disabled). For possible options please refer the parameters section.
For more extended examples on document pre-processing see the Examples.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.MultiDocumentAssembler val data = Seq("Spark NLP is an open-source text processing library.").toDF("text") val multiDocumentAssembler = new MultiDocumentAssembler().setInputCols("text").setOutputCols("document") val result = multiDocumentAssembler.transform(data) result.select("document").show(false) +----------------------------------------------------------------------------------------------+ |document | +----------------------------------------------------------------------------------------------+ |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]| +----------------------------------------------------------------------------------------------+ result.select("document").printSchema root |-- document: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- begin: integer (nullable = false) | | |-- end: integer (nullable = false) | | |-- result: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) | | |-- embeddings: array (nullable = true) | | | |-- element: float (containsNull = false)
- Grouped
- Alphabetic
- By Inheritance
- MultiDocumentAssembler
- HasOutputAnnotatorType
- DefaultParamsWritable
- MLWritable
- Transformer
- PipelineStage
- Logging
- Params
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
Type Members
- type AnnotatorType = String
- Definition Classes
- HasOutputAnnotatorType
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def $[T](param: Param[T]): T
- Attributes
- protected
- Definition Classes
- Params
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- val EMPTY_STR: String
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- val cleanupMode: Param[String]
cleanupMode can take the following values:
cleanupMode can take the following values:
disabled: keep original. Useful if need to head back to source laterinplace: newlines and tabs into whitespaces, not stringified ones, don't triminplace_full: newlines and tabs into whitespaces, including stringified, don't trimshrink: all whitespaces, newlines and tabs to a single whitespace, but not stringified, do trimshrink_full: all whitespaces, newlines and tabs to a single whitespace, stringified ones too, trim alleach: newlines and tabs to one whitespace eacheach_full: newlines and tabs, stringified ones too, to one whitespace eachdelete_full: remove stringified newlines and tabs (replace with nothing)
- final def clear(param: Param[_]): MultiDocumentAssembler.this.type
- Definition Classes
- Params
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @HotSpotIntrinsicCandidate() @native()
- def copy(extra: ParamMap): Transformer
- Definition Classes
- MultiDocumentAssembler → Transformer → PipelineStage → Params
- def copyValues[T <: Params](to: T, extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
- final def defaultCopy[T <: Params](extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def explainParam(param: Param[_]): String
- Definition Classes
- Params
- def explainParams(): String
- Definition Classes
- Params
- final def extractParamMap(): ParamMap
- Definition Classes
- Params
- final def extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
- final def get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- def getCleanupMode: String
cleanupMode to pre-process text
- final def getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
- def getIdCol: String
Id column for row reference
- def getInputCols: Array[String]
- def getMetadataCol: String
Metadata for document column
- final def getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
- def getOutputCols: Array[String]
- def getParam(paramName: String): Param[Any]
- Definition Classes
- Params
- final def hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
- def hasParam(paramName: String): Boolean
- Definition Classes
- Params
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- val idCol: Param[String]
Id column for row reference
- def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
- val inputCols: StringArrayParam
Name of input annotation cols
- final def isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- final def isSet(param: Param[_]): Boolean
- Definition Classes
- Params
- def isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def log: Logger
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logName: String
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- val metadataCol: Param[String]
Metadata for document column
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- val outputAnnotatorType: AnnotatorType
Output Annotator Type: DOCUMENT
Output Annotator Type: DOCUMENT
- Definition Classes
- MultiDocumentAssembler → HasOutputAnnotatorType
- val outputCols: StringArrayParam
- lazy val params: Array[Param[_]]
- Definition Classes
- Params
- def save(path: String): Unit
- Definition Classes
- MLWritable
- Annotations
- @throws("If the input path already exists but overwrite is not enabled.") @Since("1.6.0")
- final def set(paramPair: ParamPair[_]): MultiDocumentAssembler.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def set(param: String, value: Any): MultiDocumentAssembler.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def set[T](param: Param[T], value: T): MultiDocumentAssembler.this.type
- Definition Classes
- Params
- def setCleanupMode(v: String): MultiDocumentAssembler.this.type
cleanupMode to pre-process text
- final def setDefault(paramPairs: ParamPair[_]*): MultiDocumentAssembler.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def setDefault[T](param: Param[T], value: T): MultiDocumentAssembler.this.type
- Attributes
- protected[org.apache.spark.ml]
- Definition Classes
- Params
- def setIdCol(value: String): MultiDocumentAssembler.this.type
Id column for row reference
- def setInputCols(value: String*): MultiDocumentAssembler.this.type
- def setInputCols(value: Array[String]): MultiDocumentAssembler.this.type
- def setMetadataCol(value: String): MultiDocumentAssembler.this.type
Metadata for document column
- def setOutputCols(value: String*): MultiDocumentAssembler.this.type
- def setOutputCols(value: Array[String]): MultiDocumentAssembler.this.type
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
- def transform(dataset: Dataset[_]): DataFrame
- Definition Classes
- MultiDocumentAssembler → Transformer
- def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame
- Definition Classes
- Transformer
- Annotations
- @Since("2.0.0")
- def transform(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DataFrame
- Definition Classes
- Transformer
- Annotations
- @varargs() @Since("2.0.0")
- final def transformSchema(schema: StructType): StructType
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
- Definition Classes
- MultiDocumentAssembler → PipelineStage
- def transformSchema(schema: StructType, logging: Boolean): StructType
- Attributes
- protected
- Definition Classes
- PipelineStage
- Annotations
- @DeveloperApi()
- val uid: String
- Definition Classes
- MultiDocumentAssembler → Identifiable
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- def write: MLWriter
- Definition Classes
- DefaultParamsWritable → MLWritable
Deprecated Value Members
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable]) @Deprecated
- Deprecated
(Since version 9)
Inherited from HasOutputAnnotatorType
Inherited from DefaultParamsWritable
Inherited from MLWritable
Inherited from Transformer
Inherited from PipelineStage
Inherited from Logging
Inherited from Params
Inherited from Serializable
Inherited from Identifiable
Inherited from AnyRef
Inherited from Any
Parameters
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.
Annotator types
Required input and expected output annotator types