class Doc2Chunk extends Model[Doc2Chunk] with RawAnnotator[Doc2Chunk]
Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol.
Chunk text must be contained within input DOCUMENT. May be either StringType or
ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK
type input.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler} import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val chunkAssembler = new Doc2Chunk() .setInputCols("document") .setChunkCol("target") .setOutputCol("chunk") .setIsArray(true) val data = Seq( ("Spark NLP is an open-source text processing library for advanced natural language processing.", Seq("Spark NLP", "text processing library", "natural language processing")) ).toDF("text", "target") val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data) val result = pipeline.transform(data) result.selectExpr("chunk.result", "chunk.annotatorType").show(false) +-----------------------------------------------------------------+---------------------+ |result |annotatorType | +-----------------------------------------------------------------+---------------------+ |[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]| +-----------------------------------------------------------------+---------------------+
- See also
Chunk2Doc for converting
CHUNKannotations toDOCUMENT
- Grouped
- Alphabetic
- By Inheritance
- Doc2Chunk
- RawAnnotator
- HasOutputAnnotationCol
- HasInputAnnotationCols
- HasOutputAnnotatorType
- ParamsAndFeaturesWritable
- HasFeatures
- DefaultParamsWritable
- MLWritable
- Model
- Transformer
- PipelineStage
- Logging
- Params
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
Type Members
- type AnnotatorType = String
- Definition Classes
- HasOutputAnnotatorType
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def $[T](param: Param[T]): T
- Attributes
- protected
- Definition Classes
- Params
- def $$[T](feature: StructFeature[T]): T
- Attributes
- protected
- Definition Classes
- HasFeatures
- def $$[K, V](feature: MapFeature[K, V]): Map[K, V]
- Attributes
- protected
- Definition Classes
- HasFeatures
- def $$[T](feature: SetFeature[T]): Set[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
- def $$[T](feature: ArrayFeature[T]): Array[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- final def checkSchema(schema: StructType, inputAnnotatorType: String): Boolean
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
- val chunkCol: Param[String]
Column that contains string.
Column that contains string. Must be part of DOCUMENT
- final def clear(param: Param[_]): Doc2Chunk.this.type
- Definition Classes
- Params
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @HotSpotIntrinsicCandidate() @native()
- def copy(extra: ParamMap): Doc2Chunk
requirement for annotators copies
requirement for annotators copies
- Definition Classes
- RawAnnotator → Model → Transformer → PipelineStage → Params
- def copyValues[T <: Params](to: T, extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
- final def defaultCopy[T <: Params](extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def explainParam(param: Param[_]): String
- Definition Classes
- Params
- def explainParams(): String
- Definition Classes
- Params
- def extraValidate(structType: StructType): Boolean
- Attributes
- protected
- Definition Classes
- Doc2Chunk → RawAnnotator
- def extraValidateMsg: AnnotatorType
Override for additional custom schema checks
Override for additional custom schema checks
- Attributes
- protected
- Definition Classes
- Doc2Chunk → RawAnnotator
- final def extractParamMap(): ParamMap
- Definition Classes
- Params
- final def extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
- val failOnMissing: BooleanParam
Whether to fail the job if a chunk is not found within document, return empty otherwise (Default:
false) - val features: ArrayBuffer[Feature[_, _, _]]
- Definition Classes
- HasFeatures
- def get[T](feature: StructFeature[T]): Option[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
- def get[K, V](feature: MapFeature[K, V]): Option[Map[K, V]]
- Attributes
- protected
- Definition Classes
- HasFeatures
- def get[T](feature: SetFeature[T]): Option[Set[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
- def get[T](feature: ArrayFeature[T]): Option[Array[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
- final def get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
- def getChunkCol: String
Column that contains string.
Column that contains string. Must be part of DOCUMENT
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- final def getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
- def getFailOnMissing: Boolean
Whether to fail the job if a chunk is not found within document, return empty otherwise (Default:
false) - def getInputCols: Array[String]
- returns
input annotations columns currently used
- Definition Classes
- HasInputAnnotationCols
- def getIsArray: Boolean
Whether the chunkCol is an array of strings (Default:
false) - def getLowerCase: Boolean
Whether to lower case for matching case (Default:
true) - final def getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
- final def getOutputCol: String
Gets annotation column name going to generate
Gets annotation column name going to generate
- Definition Classes
- HasOutputAnnotationCol
- def getParam(paramName: String): Param[Any]
- Definition Classes
- Params
- def getStartCol: String
Column that has a reference of where the chunk begins
- def getStartColByTokenIndex: Boolean
Whether start col is by whitespace tokens (Default:
false) - final def hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
- def hasParam(paramName: String): Boolean
- Definition Classes
- Params
- def hasParent: Boolean
- Definition Classes
- Model
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
- val inputAnnotatorTypes: Array[String]
Input annotator types: DOCUMENT
Input annotator types: DOCUMENT
- Definition Classes
- Doc2Chunk → HasInputAnnotationCols
- final val inputCols: StringArrayParam
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
- val isArray: BooleanParam
Whether the chunkCol is an array of strings (Default:
false) - final def isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- final def isSet(param: Param[_]): Boolean
- Definition Classes
- Params
- def isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def log: Logger
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logName: String
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- val lowerCase: BooleanParam
Whether to lower case for matching case (Default:
true) - def msgHelper(schema: StructType): String
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- def onWrite(path: String, spark: SparkSession): Unit
- Attributes
- protected
- Definition Classes
- ParamsAndFeaturesWritable
- val optionalInputAnnotatorTypes: Array[String]
- Definition Classes
- HasInputAnnotationCols
- val outputAnnotatorType: AnnotatorType
Output annotator types: CHUNK
Output annotator types: CHUNK
- Definition Classes
- Doc2Chunk → HasOutputAnnotatorType
- final val outputCol: Param[String]
- Attributes
- protected
- Definition Classes
- HasOutputAnnotationCol
- lazy val params: Array[Param[_]]
- Definition Classes
- Params
- var parent: Estimator[Doc2Chunk]
- Definition Classes
- Model
- def save(path: String): Unit
- Definition Classes
- MLWritable
- Annotations
- @throws("If the input path already exists but overwrite is not enabled.") @Since("1.6.0")
- def set[T](feature: StructFeature[T], value: T): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
- def set[K, V](feature: MapFeature[K, V], value: Map[K, V]): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
- def set[T](feature: SetFeature[T], value: Set[T]): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
- def set[T](feature: ArrayFeature[T], value: Array[T]): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
- final def set(paramPair: ParamPair[_]): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def set(param: String, value: Any): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def set[T](param: Param[T], value: T): Doc2Chunk.this.type
- Definition Classes
- Params
- def setChunkCol(value: String): Doc2Chunk.this.type
Column that contains string.
Column that contains string. Must be part of DOCUMENT
- def setDefault[T](feature: StructFeature[T], value: () => T): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
- def setDefault[K, V](feature: MapFeature[K, V], value: () => Map[K, V]): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
- def setDefault[T](feature: SetFeature[T], value: () => Set[T]): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
- def setDefault[T](feature: ArrayFeature[T], value: () => Array[T]): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
- final def setDefault(paramPairs: ParamPair[_]*): Doc2Chunk.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def setDefault[T](param: Param[T], value: T): Doc2Chunk.this.type
- Attributes
- protected[org.apache.spark.ml]
- Definition Classes
- Params
- def setFailOnMissing(value: Boolean): Doc2Chunk.this.type
Whether to fail the job if a chunk is not found within document, return empty otherwise (Default:
false) - final def setInputCols(value: String*): Doc2Chunk.this.type
- Definition Classes
- HasInputAnnotationCols
- def setInputCols(value: Array[String]): Doc2Chunk.this.type
Overrides required annotators column if different than default
Overrides required annotators column if different than default
- Definition Classes
- HasInputAnnotationCols
- def setIsArray(value: Boolean): Doc2Chunk.this.type
Whether the chunkCol is an array of strings (Default:
false) - def setLowerCase(value: Boolean): Doc2Chunk.this.type
Whether to lower case for matching case (Default:
true) - final def setOutputCol(value: String): Doc2Chunk.this.type
Overrides annotation column name when transforming
Overrides annotation column name when transforming
- Definition Classes
- HasOutputAnnotationCol
- def setParent(parent: Estimator[Doc2Chunk]): Doc2Chunk
- Definition Classes
- Model
- def setStartCol(value: String): Doc2Chunk.this.type
Column that has a reference of where the chunk begins
- def setStartColByTokenIndex(value: Boolean): Doc2Chunk.this.type
Whether start col is by whitespace tokens (Default:
false) - val startCol: Param[String]
Column that has a reference of where the chunk begins
- val startColByTokenIndex: BooleanParam
Whether start col is by whitespace tokens (Default:
false) - final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
- def tokenIndexToCharIndex(text: String, tokenIndex: Int): Int
- def transform(dataset: Dataset[_]): DataFrame
- Definition Classes
- Doc2Chunk → Transformer
- def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame
- Definition Classes
- Transformer
- Annotations
- @Since("2.0.0")
- def transform(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DataFrame
- Definition Classes
- Transformer
- Annotations
- @varargs() @Since("2.0.0")
- final def transformSchema(schema: StructType): StructType
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
- Definition Classes
- RawAnnotator → PipelineStage
- def transformSchema(schema: StructType, logging: Boolean): StructType
- Attributes
- protected
- Definition Classes
- PipelineStage
- Annotations
- @DeveloperApi()
- val uid: String
- Definition Classes
- Doc2Chunk → Identifiable
- def validate(schema: StructType): Boolean
takes a Dataset and checks to see if all the required annotation types are present.
takes a Dataset and checks to see if all the required annotation types are present.
- schema
to be validated
- returns
True if all the required types are present, else false
- Attributes
- protected
- Definition Classes
- RawAnnotator
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- def wrapColumnMetadata(col: Column): Column
- Attributes
- protected
- Definition Classes
- RawAnnotator
- def write: MLWriter
- Definition Classes
- ParamsAndFeaturesWritable → DefaultParamsWritable → MLWritable
Deprecated Value Members
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable]) @Deprecated
- Deprecated
(Since version 9)
Inherited from RawAnnotator[Doc2Chunk]
Inherited from HasOutputAnnotationCol
Inherited from HasInputAnnotationCols
Inherited from HasOutputAnnotatorType
Inherited from ParamsAndFeaturesWritable
Inherited from HasFeatures
Inherited from DefaultParamsWritable
Inherited from MLWritable
Inherited from Model[Doc2Chunk]
Inherited from Transformer
Inherited from PipelineStage
Inherited from Logging
Inherited from Params
Inherited from Serializable
Inherited from Identifiable
Inherited from AnyRef
Inherited from Any
Parameters
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.
Annotator types
Required input and expected output annotator types