Package

ch.cern

sparkmeasure

Permalink

package sparkmeasure

Visibility
  1. Public
  2. All

Type Members

  1. class FlightRecorderStageMetrics extends StageInfoRecorderListener

    Permalink

    FlightRecorderStageMetrics - Use Spark Listeners defined in stagemetrics.scala to record task metrics data aggregated at the Stage level, without changing the application code.

    FlightRecorderStageMetrics - Use Spark Listeners defined in stagemetrics.scala to record task metrics data aggregated at the Stage level, without changing the application code. The resulting data can be saved to a file and/or printed to stdout.

    Use: by adding the following configuration to spark-submit (or Spark Session) configuration --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics

    Additional configuration parameters: --conf spark.sparkmeasure.outputFormat=<format>, valid values: java,json,json_to_hadoop default "json" note: json and java serialization formats, write to the driver local filesystem json_to_hadoop, writes to JSON serialized metrics to HDFS or to an Hadoop compliant filesystem, such as s3a

    --conf spark.sparkmeasure.outputFilename=<output file>, default: "/tmp/stageMetrics_flightRecorder" --conf spark.sparkmeasure.printToStdout=<true|false>, default false. Set to true to print JSON serialized metrics to stdout.

  2. class FlightRecorderTaskMetrics extends TaskInfoRecorderListener

    Permalink

    FlightRecorderTaskMetrics - Use a Spark Listener to record task metrics data and save them in a file

    FlightRecorderTaskMetrics - Use a Spark Listener to record task metrics data and save them in a file

    Use: by adding the following configuration to spark-submit (or Spark Session) configuration --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderTaskMetrics

    Additional configuration parameters: --conf spark.sparkmeasure.outputFormat=<format>, valid values: java,json,json_to_hadoop default "json" note: json and java serialization formats, write to the driver local filesystem json_to_hadoop, writes to JSON serialized metrics to HDFS or to an Hadoop compliant filesystem, such as s3a

    --conf spark.sparkmeasure.outputFilename=<output file>, default: "/tmp/taskMetrics_flightRecorder" --conf spark.sparkmeasure.printToStdout=<true|false>, default false. Set to true to print JSON serialized metrics to stdout.

  3. class InfluxDBSink extends SparkListener

    Permalink

    InfluxDBSink: write Spark metrics and application info in near real-time to InfluxDB use this mode to monitor Spark execution workload use for Grafana dashboard and analytics of job execution How to use: attach the InfluxDBSInk to a Spark Context using the extra listener infrastructure.

    InfluxDBSink: write Spark metrics and application info in near real-time to InfluxDB use this mode to monitor Spark execution workload use for Grafana dashboard and analytics of job execution How to use: attach the InfluxDBSInk to a Spark Context using the extra listener infrastructure. Example: --conf spark.extraListeners=ch.cern.sparkmeasure.InfluxDBSink

    Configuration for InfluxDBSink is handled with Spark conf parameters:

    spark.sparkmeasure.influxdbURL, example value: http://mytestInfluxDB:8086 spark.sparkmeasure.influxdbUsername (can be empty) spark.sparkmeasure.influxdbPassword (can be empty) spark.sparkmeasure.influxdbName, defaults to "sparkmeasure" spark.sparkmeasure.influxdbStagemetrics, boolean, default is false

    This code depends on "influxdb.java", you may need to add the dependency: --packages org.influxdb:influxdb-java:2.14

    InfluxDBExtended: provides additional and verbose info on Task execution use: --conf spark.extraListeners=ch.cern.sparkmeasure.InfluxDBSinkExtended

    InfluxDBSink: the amount of data generated is relatively small in most applications: O(number_of_stages) InfluxDBSInkExtended can generate a large amount of data O(Number_of_tasks), use with care

  4. class InfluxDBSinkExtended extends InfluxDBSink

    Permalink

    InfluxDBSinkExtended extends the basic Influx Sink functionality with a verbose dump of Task metrics and task info into InfluxDB Note: this can generate a large amount of data O(Number_of_tasks) Configuration parameters and how-to use: see InfluxDBSink

  5. case class PushGateway(serverIPnPort: String, metricsJob: String) extends Product with Serializable

    Permalink

    serverIPnPort: String with prometheus pushgateway hostIP:Port, metricsJob: job name

  6. case class StageAccumulablesInfo(jobId: Int, stageId: Int, submissionTime: Long, completionTime: Long, accId: Long, name: String, value: Long) extends Product with Serializable

    Permalink
  7. class StageInfoRecorderListener extends SparkListener

    Permalink
  8. case class StageMetrics(sparkSession: SparkSession) extends Product with Serializable

    Permalink
  9. case class StageVals(jobId: Int, jobGroup: String, stageId: Int, name: String, submissionTime: Long, completionTime: Long, stageDuration: Long, numTasks: Int, executorRunTime: Long, executorCpuTime: Long, executorDeserializeTime: Long, executorDeserializeCpuTime: Long, resultSerializationTime: Long, jvmGCTime: Long, resultSize: Long, numUpdatedBlockStatuses: Int, diskBytesSpilled: Long, memoryBytesSpilled: Long, peakExecutionMemory: Long, recordsRead: Long, bytesRead: Long, recordsWritten: Long, bytesWritten: Long, shuffleFetchWaitTime: Long, shuffleTotalBytesRead: Long, shuffleTotalBlocksFetched: Long, shuffleLocalBlocksFetched: Long, shuffleRemoteBlocksFetched: Long, shuffleWriteTime: Long, shuffleBytesWritten: Long, shuffleRecordsWritten: Long) extends Product with Serializable

    Permalink

    Stage Metrics: collects and aggregates metrics at the end of each stage Task Metrics: collects data at task granularity

    Stage Metrics: collects and aggregates metrics at the end of each stage Task Metrics: collects data at task granularity

    Example usage for stage metrics: val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) stageMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show)

    The tool is based on using Spark Listeners as data source and collecting metrics in a ListBuffer of a case class that encapsulates Spark task metrics. The List Buffer is then transformed into a DataFrame for ease of reporting and analysis.

  10. case class TaskAccumulablesInfo(jobId: Int, stageId: Int, taskId: Long, submissionTime: Long, finishTime: Long, accId: Long, name: String, value: Long) extends Product with Serializable

    Permalink
  11. class TaskInfoRecorderListener extends SparkListener

    Permalink
  12. case class TaskMetrics(sparkSession: SparkSession, gatherAccumulables: Boolean = false) extends Product with Serializable

    Permalink
  13. case class TaskVals(jobId: Int, jobGroup: String, stageId: Int, index: Long, launchTime: Long, finishTime: Long, duration: Long, schedulerDelay: Long, executorId: String, host: String, taskLocality: Int, speculative: Boolean, gettingResultTime: Long, successful: Boolean, executorRunTime: Long, executorCpuTime: Long, executorDeserializeTime: Long, executorDeserializeCpuTime: Long, resultSerializationTime: Long, jvmGCTime: Long, resultSize: Long, numUpdatedBlockStatuses: Int, diskBytesSpilled: Long, memoryBytesSpilled: Long, peakExecutionMemory: Long, recordsRead: Long, bytesRead: Long, recordsWritten: Long, bytesWritten: Long, shuffleFetchWaitTime: Long, shuffleTotalBytesRead: Long, shuffleTotalBlocksFetched: Long, shuffleLocalBlocksFetched: Long, shuffleRemoteBlocksFetched: Long, shuffleWriteTime: Long, shuffleBytesWritten: Long, shuffleRecordsWritten: Long) extends Product with Serializable

    Permalink

    Stage Metrics: collects and aggregates metrics at the end of each stage Task Metrics: collects data at task granularity

    Stage Metrics: collects and aggregates metrics at the end of each stage Task Metrics: collects data at task granularity

    Example usage for task metrics: val taskMetrics = ch.cern.sparkmeasure.TaskMetrics(spark) taskMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show)

    The tool is based on using Spark Listeners as data source and collecting metrics in a ListBuffer of a case class that encapsulates Spark task metrics. The List Buffer is then transformed into a DataFrame for ease of reporting and analysis.

Value Members

  1. object IOUtils

    Permalink

    The object IOUtils contains some helper code for the sparkMeasure package The methods readSerializedStageMetrics and readSerializedTaskMetrics are used to read data serialized into files by "flight recorder" mode.

    The object IOUtils contains some helper code for the sparkMeasure package The methods readSerializedStageMetrics and readSerializedTaskMetrics are used to read data serialized into files by "flight recorder" mode. Two serialization modes are supported currently: java serialization and JSON serialization with jackson library.

  2. object Utils

    Permalink

    The object Utils contains some helper code for the sparkMeasure package The methods formatDuration and formatBytes are used for printing stage metrics reports

Ungrouped