Class/Object

org.apache.spark.ml.odkl

ForkedSparkEstimator

Related Docs: object ForkedSparkEstimator | package odkl

Permalink

class ForkedSparkEstimator[M <: ModelWithSummary[M] with MLWritable, E <: SummarizableEstimator[M] with MLWritable] extends Estimator[M] with SummarizableEstimator[M] with MLWritable

This utility is used to support evaluation of the part of pipeline in a separate Spark app. There are at least three identified use cases: 1. Spark App with different settings for ETL and ML 2. Support for larger fork factor in segmented hyperopt (scale driver if it became a bootleneck) 3. Support for parallel XGBoost training (resolves internal conflict on the Rabbit part)

Simple example with linear SGD and Zeppelin in yarn-client mode:

// This estimator will start new Spark app from an app running in yarn-cluster mode
val secondLevel = new ForkedSparkEstimator[LinearRegressionModel, LinearRegressionSGD](new LinearRegressionSGD().setCacheTrainData(true))
            .setTempPath("tmp/forkedModels")
            // Match only files transfered with the app, re-point to the hdfs for faster start
            .withClassPathPropagation(".*__spark_libs__.*", ".+/" -> "hdfs://my-hadoop-nn/spark/lib/")
            // These files are localy available on all nodes
            .withClassPathPropagation("/opt/.*", "^/" -> "local://")
            // For convinience propagate configuration when working in non-interactive mode
            .setPropagateConfig(true)
            .setConfOverrides(
                // Enable log aggregation and disable dynamic allocation
                "spark.hadoop.yarn.log-aggregation-enable" -> "true",
                "spark.dynamicAllocation.enabled" -> "false",
                // These files might sneeak in when submited from Zeppelin, suppress them
                "spark.yarn.dist.jars" -> "",
                "spark.yarn.dist.files" -> "",
                "spark.yarn.dist.archives" -> ""
                )
            .setMaster("yarn")
            .setDeployMode("cluster")
            .setSubmitArgs(
                "--num-executors", "1")
            .setName("secondLevel")

// This estimator is will start neq Spark app from an interactive Zeppelin session
val firstLevel = new ForkedSparkEstimator[LinearRegressionModel, ForkedSparkEstimator[LinearRegressionModel,LinearRegressionSGD]](secondLevel)
        .setTempPath("tmp/forkedModels")
        // Propagate only odkl-analysiss jars, repoint to HDFS for faster start
        .withClassPathPropagation("/home/.*", ".+/" -> "hdfs://my-hadoop-nn/user/myuser/spark/lib/")
        // Do not propagate hell a lot of Zeppelin configs, rely on spark-defaults
        .setPropagateConfig(false)
        .setConfOverrides(
            // Enable log aggregation and disable dynamic execution
            "spark.hadoop.yarn.log-aggregation-enable" -> "true",
            "spark.dynamicAllocation.enabled" -> "false",
            // This is required to be able to start new spark apps from our app
            "spark.yarn.appMasterEnv.HADOOP_CONF_DIR" -> "/opt/hadoop/etc/hadoop/",
            // This is required to make sure Zeppelin does not full us the we are a Python app
            "spark.yarn.isPython" -> "false"
             )
        .setMaster("yarn")
        .setDeployMode("cluster")
        .setSubmitArgs(
            "--num-executors", "1")
        .setName("firstLevel")


val doubleForkedPipeiline = new Pipeline().setStages(Array(
    new VectorAssembler()
        .setInputCols(Array("first", "second"))
        .setOutputCol("features"),
    firstLevel
    ))
Linear Supertypes
MLWritable, SummarizableEstimator[M], Estimator[M], PipelineStage, Logging, Params, Serializable, Serializable, Identifiable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. ForkedSparkEstimator
  2. MLWritable
  3. SummarizableEstimator
  4. Estimator
  5. PipelineStage
  6. Logging
  7. Params
  8. Serializable
  9. Serializable
  10. Identifiable
  11. AnyRef
  12. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new ForkedSparkEstimator(nested: E)

    Permalink
  2. new ForkedSparkEstimator(uid: String)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def $[T](param: Param[T]): T

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  4. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. final val classPathPropagations: JacksonParam[Array[ClassPathExpression]]

    Permalink
  7. final def clear(param: Param[_]): ForkedSparkEstimator.this.type

    Permalink
    Definition Classes
    Params
  8. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  9. final val confOverrides: JacksonParam[Map[String, String]]

    Permalink
  10. def copy(extra: ParamMap): ForkedSparkEstimator[M, E]

    Permalink
    Definition Classes
    ForkedSparkEstimatorSummarizableEstimator → Estimator → PipelineStage → Params
  11. def copyValues[T <: Params](to: T, extra: ParamMap): T

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  12. final def defaultCopy[T <: Params](extra: ParamMap): T

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  13. final val deployMode: Param[String]

    Permalink
  14. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  15. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  16. def explainParam(param: Param[_]): String

    Permalink
    Definition Classes
    Params
  17. def explainParams(): String

    Permalink
    Definition Classes
    Params
  18. final def extractParamMap(): ParamMap

    Permalink
    Definition Classes
    Params
  19. final def extractParamMap(extra: ParamMap): ParamMap

    Permalink
    Definition Classes
    Params
  20. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  21. def fit(dataset: Dataset[_]): M

    Permalink
    Definition Classes
    ForkedSparkEstimator → Estimator
  22. def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M]

    Permalink
    Definition Classes
    Estimator
    Annotations
    @Since( "2.0.0" )
  23. def fit(dataset: Dataset[_], paramMap: ParamMap): M

    Permalink
    Definition Classes
    Estimator
    Annotations
    @Since( "2.0.0" )
  24. def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): M

    Permalink
    Definition Classes
    Estimator
    Annotations
    @Since( "2.0.0" ) @varargs()
  25. final def get[T](param: Param[T]): Option[T]

    Permalink
    Definition Classes
    Params
  26. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  27. final def getDefault[T](param: Param[T]): Option[T]

    Permalink
    Definition Classes
    Params
  28. final def getOrDefault[T](param: Param[T]): T

    Permalink
    Definition Classes
    Params
  29. def getParam(paramName: String): Param[Any]

    Permalink
    Definition Classes
    Params
  30. final def hasDefault[T](param: Param[T]): Boolean

    Permalink
    Definition Classes
    Params
  31. def hasParam(paramName: String): Boolean

    Permalink
    Definition Classes
    Params
  32. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  33. def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  34. def initializeLogIfNecessary(isInterpreter: Boolean): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  35. final def isDefined(param: Param[_]): Boolean

    Permalink
    Definition Classes
    Params
  36. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  37. final def isSet(param: Param[_]): Boolean

    Permalink
    Definition Classes
    Params
  38. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  39. def log: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  40. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  41. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  42. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  43. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  44. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  45. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  46. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  47. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  48. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  49. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  50. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  51. final val mainJar: Param[String]

    Permalink
  52. final val master: Param[String]

    Permalink
  53. final val name: Param[String]

    Permalink
  54. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  55. def nested: E

    Permalink
  56. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  57. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  58. lazy val params: Array[Param[_]]

    Permalink
    Definition Classes
    Params
  59. final val propagateConfig: BooleanParam

    Permalink
  60. def save(path: String): Unit

    Permalink
    Definition Classes
    MLWritable
    Annotations
    @Since( "1.6.0" ) @throws( ... )
  61. final def set(paramPair: ParamPair[_]): ForkedSparkEstimator.this.type

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  62. final def set(param: String, value: Any): ForkedSparkEstimator.this.type

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  63. final def set[T](param: Param[T], value: T): ForkedSparkEstimator.this.type

    Permalink
    Definition Classes
    Params
  64. def setConfOverrides(conf: (String, String)*): ForkedSparkEstimator.this.type

    Permalink
  65. final def setDefault(paramPairs: ParamPair[_]*): ForkedSparkEstimator.this.type

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  66. final def setDefault[T](param: Param[T], value: T): ForkedSparkEstimator.this.type

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  67. def setDeployMode(value: String): ForkedSparkEstimator.this.type

    Permalink
  68. def setMainJar(value: String): ForkedSparkEstimator.this.type

    Permalink
  69. def setMaster(value: String): ForkedSparkEstimator.this.type

    Permalink
  70. def setName(value: String): ForkedSparkEstimator.this.type

    Permalink
  71. def setPropagateConfig(value: Boolean): ForkedSparkEstimator.this.type

    Permalink
  72. def setSubmitArgs(args: String*): ForkedSparkEstimator.this.type

    Permalink
  73. def setSuppressConfigs(name: String*): ForkedSparkEstimator.this.type

    Permalink
  74. def setTempPath(path: String): ForkedSparkEstimator.this.type

    Permalink
  75. final val submitArgs: StringArrayParam

    Permalink
  76. final val suppressConfig: StringArrayParam

    Permalink
  77. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  78. final val tempPath: Param[String]

    Permalink
  79. def toString(): String

    Permalink
    Definition Classes
    Identifiable → AnyRef → Any
  80. def transformSchema(schema: StructType): StructType

    Permalink
    Definition Classes
    ForkedSparkEstimator → PipelineStage
  81. def transformSchema(schema: StructType, logging: Boolean): StructType

    Permalink
    Attributes
    protected
    Definition Classes
    PipelineStage
    Annotations
    @DeveloperApi()
  82. val uid: String

    Permalink
    Definition Classes
    ForkedSparkEstimator → Identifiable
  83. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  84. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  85. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  86. def withClassPathPropagation(filter: String, transformations: (String, String)*): ForkedSparkEstimator.this.type

    Permalink
  87. def write: MLWriter

    Permalink
    Definition Classes
    ForkedSparkEstimator → MLWritable

Inherited from MLWritable

Inherited from SummarizableEstimator[M]

Inherited from Estimator[M]

Inherited from PipelineStage

Inherited from Logging

Inherited from Params

Inherited from Serializable

Inherited from Serializable

Inherited from Identifiable

Inherited from AnyRef

Inherited from Any

Ungrouped