datasources

Type Members

class CaseInsensitiveMap extends Map[String, String] with Serializable

Builds a map in which keys are case insensitive
case class CreateTableUsing(tableIdent: TableIdentifier, userSpecifiedSchema: Option[StructType], provider: String, temporary: Boolean, options: Map[String, String], partitionColumns: Array[String], bucketSpec: Option[BucketSpec], allowExisting: Boolean, managedIfNoPath: Boolean) extends LogicalPlan with Command with Product with Serializable

Used to represent the operation of create table using a data source.
Used to represent the operation of create table using a data source.
allowExisting
If it is true, we will do nothing when the table already exists. If it is false, an exception will be thrown
case class CreateTableUsingAsSelect(tableIdent: TableIdentifier, provider: String, partitionColumns: Array[String], bucketSpec: Option[BucketSpec], mode: SaveMode, options: Map[String, String], child: LogicalPlan) extends UnaryNode with Product with Serializable

A node used to support CTAS statements and saveAsTable for the data source API.
A node used to support CTAS statements and saveAsTable for the data source API. This node is a logical.UnaryNode instead of a logical.Command because we want the analyzer can analyze the logical plan that will be used to populate the table. So, PreWriteCheck can detect cases that are not allowed.
case class CreateTempViewUsing(tableIdent: TableIdentifier, userSpecifiedSchema: Option[StructType], replace: Boolean, provider: String, options: Map[String, String]) extends LogicalPlan with RunnableCommand with Product with Serializable
case class DataSource(sparkSession: SparkSession, className: String, paths: Seq[String] = Nil, userSpecifiedSchema: Option[StructType] = None, partitionColumns: Seq[String] = Seq.empty, bucketSpec: Option[BucketSpec] = None, options: Map[String, String] = Map.empty) extends Logging with Product with Serializable

The main class responsible for representing a pluggable Data Source in Spark SQL.
The main class responsible for representing a pluggable Data Source in Spark SQL. In addition to acting as the canonical set of parameters that can describe a Data Source, this class is used to resolve a description to a concrete implementation that can be used in a query plan (either batch or streaming) or to write out data using an external library.
From an end user's perspective a DataSource description can be created explicitly using org.apache.spark.sql.DataFrameReader or CREATE TABLE USING DDL. Additionally, this class is used when resolving a description from a metastore to a concrete implementation.
Many of the arguments to this class are optional, though depending on the specific API being used these optional arguments might be filled in during resolution using either inference or external metadata. For example, when reading a partitioned table from a file system, partition columns will be inferred from the directory layout even if they are not specified.
paths
A list of file system paths that hold data. These will be globbed before and qualified. This option only works when reading from a FileFormat.
userSpecifiedSchema
An optional specification of the schema of the data. When present we skip attempting to infer the schema.
partitionColumns
A list of column names that the relation is partitioned by. When this list is empty, the relation is unpartitioned.
bucketSpec
An optional specification for bucketing (hash-partitioning) of the data.
trait FileCatalog extends AnyRef

An interface for objects capable of enumerating the files that comprise a relation as well as the partitioning characteristics of those files.
trait FileFormat extends AnyRef

Used to read and write data stored in files to/from the InternalRow format.
case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends spark.Partition with Product with Serializable

A collection of files that should be read as a single task possibly from multiple partitioned directories.
A collection of files that should be read as a single task possibly from multiple partitioned directories.
TODO: This currently does not take locality information about the files into account.
class FileScanRDD extends RDD[InternalRow]
class HadoopFileLinesReader extends Iterator[Text]

An adaptor from a PartitionedFile to an Iterator of Text, which are all of the lines in that file.
case class HadoopFsRelation(sparkSession: SparkSession, location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String]) extends BaseRelation with FileRelation with Product with Serializable

Acts as a container for all of the metadata required to read from a datasource.
Acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed.
location
A FileCatalog that can enumerate the locations of all the files that comprise this relation.
partitionSchema
The schema of the columns (if any) that are used to partition the relation
dataSchema
The schema of any remaining columns. Note that if any partition columns are present in the actual data files as well, they are preserved.
bucketSpec
Describes the bucketing (hash-partitioning of the files by some column values).
fileFormat
A file format that can be used to read and write the data in files.
options
Configuration used when reading / writing data.
class ListingFileCatalog extends PartitioningAwareFileCatalog

A FileCatalog that generates the list of files to process by recursively listing all the files present in paths.
case class LogicalRelation(relation: BaseRelation, expectedOutputAttributes: Option[Seq[Attribute]] = None, metastoreTableIdentifier: Option[TableIdentifier] = None) extends LeafNode with MultiInstanceRelation with Product with Serializable

Used to link a BaseRelation in to a logical query plan.
Used to link a BaseRelation in to a logical query plan.
Note that sometimes we need to use LogicalRelation to replace an existing leaf node without changing the output attributes' IDs. The expectedOutputAttributes parameter is used for this purpose. See https://issues.apache.org/jira/browse/SPARK-10741 for more details.
abstract class OutputWriter extends AnyRef

::Experimental:: OutputWriter is used together with HadoopFsRelation for persisting rows to the underlying file system.
::Experimental:: OutputWriter is used together with HadoopFsRelation for persisting rows to the underlying file system. Subclasses of OutputWriter must provide a zero-argument constructor. An OutputWriter instance is created and initialized when a new output file is opened on executor side. This instance is used to persist rows to this single output file.

Annotations
@Experimental()
Since
1.4.0
abstract class OutputWriterFactory extends Serializable

::Experimental:: A factory that produces OutputWriters.
::Experimental:: A factory that produces OutputWriters. A new OutputWriterFactory is created on driver side for each write job issued when writing to a HadoopFsRelation, and then gets serialized to executor side to create actual OutputWriters on the fly.

Annotations
@Experimental()
Since
1.4.0
case class Partition(values: InternalRow, files: Seq[FileStatus]) extends Product with Serializable

A collection of data files from a partitioned relation, along with the partition values in the form of an InternalRow.
case class PartitionedFile(partitionValues: InternalRow, filePath: String, start: Long, length: Long, locations: Array[String] = Array.empty) extends Product with Serializable

A single file that should be read, along with partition column values that need to be prepended to each row.
A single file that should be read, along with partition column values that need to be prepended to each row. The reading should start at the first valid record found after start.
abstract class PartitioningAwareFileCatalog extends FileCatalog with Logging

An abstract class that represents FileCatalogs that are aware of partitioned tables.
An abstract class that represents FileCatalogs that are aware of partitioned tables. It provides the necessary methods to parse partition data based on a set of files.
class RecordReaderIterator[T] extends Iterator[T]

An adaptor from a Hadoop RecordReader to an Iterator over the values returned.
An adaptor from a Hadoop RecordReader to an Iterator over the values returned.
Note that this returns Objects instead of InternalRow because we rely on erasure to pass column batches by pretending they are rows.
case class RefreshResource(path: String) extends LogicalPlan with RunnableCommand with Product with Serializable
case class RefreshTable(tableIdent: TableIdentifier) extends LogicalPlan with RunnableCommand with Product with Serializable
abstract class TextBasedFileFormat extends FileFormat

The base class file format that is based on text file.
case class WriteRelation(sparkSession: SparkSession, dataSchema: StructType, path: String, prepareJobForWrite: (Job) ⇒ OutputWriterFactory, bucketSpec: Option[BucketSpec]) extends Product with Serializable

A container for all the details required when writing to a table.

Value Members

object PartitionDirectory extends Serializable
package csv
package jdbc
package json
package parquet
package text

package datasources

Type Members

class CaseInsensitiveMap extends Map[String, String] with Serializable

case class CreateTableUsingAsSelect(tableIdent: TableIdentifier, provider: String, partitionColumns: Array[String], bucketSpec: Option[BucketSpec], mode: SaveMode, options: Map[String, String], child: LogicalPlan) extends UnaryNode with Product with Serializable

case class CreateTempViewUsing(tableIdent: TableIdentifier, userSpecifiedSchema: Option[StructType], replace: Boolean, provider: String, options: Map[String, String]) extends LogicalPlan with RunnableCommand with Product with Serializable

trait FileCatalog extends AnyRef

trait FileFormat extends AnyRef

case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends spark.Partition with Product with Serializable

class FileScanRDD extends RDD[InternalRow]

class HadoopFileLinesReader extends Iterator[Text]

case class HadoopFsRelation(sparkSession: SparkSession, location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String]) extends BaseRelation with FileRelation with Product with Serializable

class ListingFileCatalog extends PartitioningAwareFileCatalog

case class LogicalRelation(relation: BaseRelation, expectedOutputAttributes: Option[Seq[Attribute]] = None, metastoreTableIdentifier: Option[TableIdentifier] = None) extends LeafNode with MultiInstanceRelation with Product with Serializable

abstract class OutputWriter extends AnyRef

abstract class OutputWriterFactory extends Serializable

case class Partition(values: InternalRow, files: Seq[FileStatus]) extends Product with Serializable

case class PartitionedFile(partitionValues: InternalRow, filePath: String, start: Long, length: Long, locations: Array[String] = Array.empty) extends Product with Serializable

abstract class PartitioningAwareFileCatalog extends FileCatalog with Logging

class RecordReaderIterator[T] extends Iterator[T]

case class RefreshResource(path: String) extends LogicalPlan with RunnableCommand with Product with Serializable

case class RefreshTable(tableIdent: TableIdentifier) extends LogicalPlan with RunnableCommand with Product with Serializable

abstract class TextBasedFileFormat extends FileFormat

case class WriteRelation(sparkSession: SparkSession, dataSchema: StructType, path: String, prepareJobForWrite: (Job) ⇒ OutputWriterFactory, bucketSpec: Option[BucketSpec]) extends Product with Serializable

Value Members

object PartitionDirectory extends Serializable

package csv

package jdbc

package json

package parquet

package text

Ungrouped