Package

org.apache.spark.sql.execution

datasources

Permalink

package datasources

Visibility
  1. Public
  2. All

Type Members

  1. case class AnalyzeCreateTableAsSelect(sparkSession: SparkSession) extends Rule[LogicalPlan] with Product with Serializable

    Permalink

    Analyze the query in CREATE TABLE AS SELECT (CTAS).

    Analyze the query in CREATE TABLE AS SELECT (CTAS). After analysis, PreWriteCheck also can detect the cases that are not allowed.

  2. class CaseInsensitiveMap extends Map[String, String] with Serializable

    Permalink

    Builds a map in which keys are case insensitive

  3. case class CreateTableUsing(tableIdent: TableIdentifier, userSpecifiedSchema: Option[StructType], provider: String, temporary: Boolean, options: Map[String, String], partitionColumns: Array[String], bucketSpec: Option[BucketSpec], allowExisting: Boolean, managedIfNoPath: Boolean) extends LeafNode with Command with Product with Serializable

    Permalink

    Used to represent the operation of create table using a data source.

    Used to represent the operation of create table using a data source.

    allowExisting

    If it is true, we will do nothing when the table already exists. If it is false, an exception will be thrown

  4. case class CreateTableUsingAsSelect(tableIdent: TableIdentifier, provider: String, partitionColumns: Array[String], bucketSpec: Option[BucketSpec], mode: SaveMode, options: Map[String, String], query: LogicalPlan) extends LeafNode with Command with Product with Serializable

    Permalink

    A node used to support CTAS statements and saveAsTable for the data source API.

  5. case class CreateTempViewUsing(tableIdent: TableIdentifier, userSpecifiedSchema: Option[StructType], replace: Boolean, provider: String, options: Map[String, String]) extends LeafNode with RunnableCommand with Product with Serializable

    Permalink
  6. case class DataSource(sparkSession: SparkSession, className: String, paths: Seq[String] = Nil, userSpecifiedSchema: Option[StructType] = None, partitionColumns: Seq[String] = Seq.empty, bucketSpec: Option[BucketSpec] = None, options: Map[String, String] = Map.empty) extends Logging with Product with Serializable

    Permalink

    The main class responsible for representing a pluggable Data Source in Spark SQL.

    The main class responsible for representing a pluggable Data Source in Spark SQL. In addition to acting as the canonical set of parameters that can describe a Data Source, this class is used to resolve a description to a concrete implementation that can be used in a query plan (either batch or streaming) or to write out data using an external library.

    From an end user's perspective a DataSource description can be created explicitly using org.apache.spark.sql.DataFrameReader or CREATE TABLE USING DDL. Additionally, this class is used when resolving a description from a metastore to a concrete implementation.

    Many of the arguments to this class are optional, though depending on the specific API being used these optional arguments might be filled in during resolution using either inference or external metadata. For example, when reading a partitioned table from a file system, partition columns will be inferred from the directory layout even if they are not specified.

    paths

    A list of file system paths that hold data. These will be globbed before and qualified. This option only works when reading from a FileFormat.

    userSpecifiedSchema

    An optional specification of the schema of the data. When present we skip attempting to infer the schema.

    partitionColumns

    A list of column names that the relation is partitioned by. When this list is empty, the relation is unpartitioned.

    bucketSpec

    An optional specification for bucketing (hash-partitioning) of the data.

  7. case class DataSourceAnalysis(conf: CatalystConf) extends Rule[LogicalPlan] with Product with Serializable

    Permalink

    Replaces generic operations with specific variants that are designed to work with Spark SQL Data Sources.

  8. trait FileCatalog extends AnyRef

    Permalink

    An interface for objects capable of enumerating the files that comprise a relation as well as the partitioning characteristics of those files.

  9. trait FileFormat extends AnyRef

    Permalink

    Used to read and write data stored in files to/from the InternalRow format.

  10. case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends spark.Partition with Product with Serializable

    Permalink

    A collection of files that should be read as a single task possibly from multiple partitioned directories.

    A collection of files that should be read as a single task possibly from multiple partitioned directories.

    TODO: This currently does not take locality information about the files into account.

  11. class FileScanRDD extends RDD[InternalRow]

    Permalink
  12. class FindDataSourceTable extends Rule[LogicalPlan]

    Permalink

    Replaces SimpleCatalogRelation with data source table if its table property contains data source information.

  13. class HadoopFileLinesReader extends Iterator[Text] with Closeable

    Permalink

    An adaptor from a PartitionedFile to an Iterator of Text, which are all of the lines in that file.

  14. case class HadoopFsRelation(location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String])(sparkSession: SparkSession) extends BaseRelation with FileRelation with Product with Serializable

    Permalink

    Acts as a container for all of the metadata required to read from a datasource.

    Acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed.

    location

    A FileCatalog that can enumerate the locations of all the files that comprise this relation.

    partitionSchema

    The schema of the columns (if any) that are used to partition the relation

    dataSchema

    The schema of any remaining columns. Note that if any partition columns are present in the actual data files as well, they are preserved.

    bucketSpec

    Describes the bucketing (hash-partitioning of the files by some column values).

    fileFormat

    A file format that can be used to read and write the data in files.

    options

    Configuration used when reading / writing data.

  15. case class InsertIntoDataSourceCommand(logicalRelation: LogicalRelation, query: LogicalPlan, overwrite: Boolean) extends LeafNode with RunnableCommand with Product with Serializable

    Permalink

    Inserts the results of query in to a relation that extends InsertableRelation.

  16. case class InsertIntoHadoopFsRelationCommand(outputPath: Path, partitionColumns: Seq[Attribute], bucketSpec: Option[BucketSpec], fileFormat: FileFormat, refreshFunction: () ⇒ Unit, options: Map[String, String], query: LogicalPlan, mode: SaveMode) extends LeafNode with RunnableCommand with Product with Serializable

    Permalink

    A command for writing data to a HadoopFsRelation.

    A command for writing data to a HadoopFsRelation. Supports both overwriting and appending. Writing to dynamic partitions is also supported. Each InsertIntoHadoopFsRelationCommand issues a single write job, and owns a UUID that identifies this job. Each concrete implementation of HadoopFsRelation should use this UUID together with task id to generate unique file path for each task output file. This UUID is passed to executor side via a property named spark.sql.sources.writeJobUUID.

    Different writer containers, DefaultWriterContainer and DynamicPartitionWriterContainer are used to write to normal tables and tables with dynamic partitions.

    Basic work flow of this command is:

    1. Driver side setup, including output committer initialization and data source specific preparation work for the write job to be issued. 2. Issues a write job consists of one or more executor side tasks, each of which writes all rows within an RDD partition. 3. If no exception is thrown in a task, commits that task, otherwise aborts that task; If any exception is thrown during task commitment, also aborts that task. 4. If all tasks are committed, commit the job, otherwise aborts the job; If any exception is thrown during job commitment, also aborts the job.
  17. class ListingFileCatalog extends PartitioningAwareFileCatalog

    Permalink

    A FileCatalog that generates the list of files to process by recursively listing all the files present in paths.

  18. case class LogicalRelation(relation: BaseRelation, expectedOutputAttributes: Option[Seq[Attribute]] = None, metastoreTableIdentifier: Option[TableIdentifier] = None) extends LeafNode with MultiInstanceRelation with Product with Serializable

    Permalink

    Used to link a BaseRelation in to a logical query plan.

    Used to link a BaseRelation in to a logical query plan.

    Note that sometimes we need to use LogicalRelation to replace an existing leaf node without changing the output attributes' IDs. The expectedOutputAttributes parameter is used for this purpose. See https://issues.apache.org/jira/browse/SPARK-10741 for more details.

  19. abstract class OutputWriter extends AnyRef

    Permalink

    ::Experimental:: OutputWriter is used together with HadoopFsRelation for persisting rows to the underlying file system.

    ::Experimental:: OutputWriter is used together with HadoopFsRelation for persisting rows to the underlying file system. Subclasses of OutputWriter must provide a zero-argument constructor. An OutputWriter instance is created and initialized when a new output file is opened on executor side. This instance is used to persist rows to this single output file.

    Annotations
    @Experimental()
    Since

    1.4.0

  20. abstract class OutputWriterFactory extends Serializable

    Permalink

    ::Experimental:: A factory that produces OutputWriters.

    ::Experimental:: A factory that produces OutputWriters. A new OutputWriterFactory is created on driver side for each write job issued when writing to a HadoopFsRelation, and then gets serialized to executor side to create actual OutputWriters on the fly.

    Annotations
    @Experimental()
    Since

    1.4.0

  21. case class Partition(values: InternalRow, files: Seq[FileStatus]) extends Product with Serializable

    Permalink

    A collection of data files from a partitioned relation, along with the partition values in the form of an InternalRow.

  22. case class PartitionDirectory(values: InternalRow, path: Path) extends Product with Serializable

    Permalink

    Holds a directory in a partitioned collection of files as well as as the partition values in the form of a Row.

    Holds a directory in a partitioned collection of files as well as as the partition values in the form of a Row. Before scanning, the files at path need to be enumerated.

  23. case class PartitionSpec(partitionColumns: StructType, partitions: Seq[PartitionDirectory]) extends Product with Serializable

    Permalink
  24. case class PartitionedFile(partitionValues: InternalRow, filePath: String, start: Long, length: Long, locations: Array[String] = Array.empty) extends Product with Serializable

    Permalink

    A single file that should be read, along with partition column values that need to be prepended to each row.

    A single file that should be read, along with partition column values that need to be prepended to each row. The reading should start at the first valid record found after start.

  25. abstract class PartitioningAwareFileCatalog extends FileCatalog with Logging

    Permalink

    An abstract class that represents FileCatalogs that are aware of partitioned tables.

    An abstract class that represents FileCatalogs that are aware of partitioned tables. It provides the necessary methods to parse partition data based on a set of files.

  26. case class PreWriteCheck(conf: SQLConf, catalog: SessionCatalog) extends (LogicalPlan) ⇒ Unit with Product with Serializable

    Permalink

    A rule to do various checks before inserting into or writing to a data source table.

  27. case class PreprocessTableInsertion(conf: SQLConf) extends Rule[LogicalPlan] with Product with Serializable

    Permalink

    Preprocess the InsertIntoTable plan.

    Preprocess the InsertIntoTable plan. Throws exception if the number of columns mismatch, or specified partition columns are different from the existing partition columns in the target table. It also does data type casting and field renaming, to make sure that the columns to be inserted have the correct data type and fields have the correct names.

  28. class RecordReaderIterator[T] extends Iterator[T] with Closeable

    Permalink

    An adaptor from a Hadoop RecordReader to an Iterator over the values returned.

    An adaptor from a Hadoop RecordReader to an Iterator over the values returned.

    Note that this returns Objects instead of InternalRow because we rely on erasure to pass column batches by pretending they are rows.

  29. case class RefreshResource(path: String) extends LeafNode with RunnableCommand with Product with Serializable

    Permalink
  30. case class RefreshTable(tableIdent: TableIdentifier) extends LeafNode with RunnableCommand with Product with Serializable

    Permalink
  31. class ResolveDataSource extends Rule[LogicalPlan]

    Permalink

    Try to replaces UnresolvedRelations with ResolveDataSource.

  32. abstract class TextBasedFileFormat extends FileFormat

    Permalink

    The base class file format that is based on text file.

Value Members

  1. object BucketingUtils

    Permalink
  2. object DataSourceStrategy extends Strategy with Logging

    Permalink

    A Strategy for planning scans over data sources defined using the sources API.

  3. object FileSourceStrategy extends Strategy with Logging

    Permalink

    A strategy for planning scans over collections of files that might be partitioned or bucketed by user specified columns.

    A strategy for planning scans over collections of files that might be partitioned or bucketed by user specified columns.

    At a high level planning occurs in several phases:

    • Split filters by when they need to be evaluated.
    • Prune the schema of the data requested based on any projections present. Today this pruning is only done on top level columns, but formats should support pruning of nested columns as well.
    • Construct a reader function by passing filters and the schema into the FileFormat.
    • Using a partition pruning predicates, enumerate the list of files that should be read.
    • Split the files into tasks and construct a FileScanRDD.
    • Add any projection or filters that must be evaluated after the scan.

    Files are assigned into tasks using the following algorithm:

    • If the table is bucketed, group files by bucket id into the correct number of partitions.
    • If the table is not bucketed or bucketing is turned off:
      • If any file is larger than the threshold, split it into pieces based on that threshold
      • Sort the files by decreasing file size.
      • Assign the ordered files to buckets using the following algorithm. If the current partition is under the threshold with the addition of the next file, add it. If not, open a new bucket and add it. Proceed to the next file.
  4. object HadoopFsRelation extends Logging with Serializable

    Permalink

    Helper methods for gathering metadata from HDFS.

  5. object PartitionDirectory extends Serializable

    Permalink
  6. object PartitionSpec extends Serializable

    Permalink
  7. object PartitioningUtils

    Permalink
  8. package csv

    Permalink
  9. package jdbc

    Permalink
  10. package json

    Permalink
  11. package parquet

    Permalink
  12. package text

    Permalink

Ungrouped