A FileIndex for a metastore catalog table.
Create a table and optionally insert some data into it.
Create or replace a local/global temporary view with given data source.
The main class responsible for representing a pluggable Data Source in Spark SQL.
The main class responsible for representing a pluggable Data Source in Spark SQL. In addition to acting as the canonical set of parameters that can describe a Data Source, this class is used to resolve a description to a concrete implementation that can be used in a query plan (either batch or streaming) or to write out data using an external library.
From an end user's perspective a DataSource description can be created explicitly using org.apache.spark.sql.DataFrameReader or CREATE TABLE USING DDL. Additionally, this class is used when resolving a description from a metastore to a concrete implementation.
Many of the arguments to this class are optional, though depending on the specific API being used these optional arguments might be filled in during resolution using either inference or external metadata. For example, when reading a partitioned table from a file system, partition columns will be inferred from the directory layout even if they are not specified.
A list of file system paths that hold data. These will be globbed before and qualified. This option only works when reading from a FileFormat.
An optional specification of the schema of the data. When present we skip attempting to infer the schema.
A list of column names that the relation is partitioned by. This list is
generally empty during the read path, unless this DataSource is managed
by Hive. In these cases, during resolveRelation
, we will call
getOrInferFileFormatSchema
for file based DataSources to infer the
partitioning. In other cases, if this list is empty, then this table
is unpartitioned.
An optional specification for bucketing (hash-partitioning) of the data.
Optional catalog table reference that can be used to push down operations over the datasource to the catalog service.
Replaces generic operations with specific variants that are designed to work with Spark SQL Data Sources.
Replaces generic operations with specific variants that are designed to work with Spark SQL Data Sources.
Note that, this rule must be run after PreprocessTableCreation
and
PreprocessTableInsertion
.
A Strategy for planning scans over data sources defined using the sources API.
Used to read and write data stored in files to/from the InternalRow format.
An interface for objects capable of enumerating the root paths of a relation as well as the partitions of a relation subject to some pruning expressions.
A collection of file blocks that should be read as a single task (possibly from multiple partitioned directories).
An RDD that scans a list of file partitions.
A cache of the leaf files of partition directories.
A cache of the leaf files of partition directories. We cache these files in order to speed up iterated queries over the same set of partitions. Otherwise, each query would have to hit remote storage in order to gather file statistics for physical planning.
Each resolved catalog table has its own FileStatusCache. When the backing relation for the table is refreshed via refreshTable() or refreshByPath(), this cache will be invalidated.
Replaces CatalogRelation with data source table if its table provider is not hive.
An adaptor from a PartitionedFile to an Iterator of Text, which are all of the lines in that file.
Acts as a container for all of the metadata required to read from a datasource.
Acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed.
A FileIndex that can enumerate the locations of all the files that comprise this relation.
The schema of the columns (if any) that are used to partition the relation
The schema of any remaining columns. Note that if any partition columns are present in the actual data files as well, they are preserved.
Describes the bucketing (hash-partitioning of the files by some column values).
A file format that can be used to read and write the data in files.
Configuration used when reading / writing data.
A FileIndex that generates the list of files to process by recursively listing all the
files present in paths
.
Inserts the results of query
in to a relation that extends InsertableRelation.
A command for writing data to a HadoopFsRelation.
A command for writing data to a HadoopFsRelation. Supports both overwriting and appending. Writing to dynamic partitions is also supported.
partial partitioning spec for write. This defines the scope of partition overwrites: when the spec is empty, all partitions are overwritten. When it covers a prefix of the partition keys, only partitions matching the prefix are overwritten.
If true, only write if the partition does not exist. Only valid for static partitions.
Used to link a BaseRelation in to a logical query plan.
OutputWriter is used together with HadoopFsRelation for persisting rows to the underlying file system.
OutputWriter is used together with HadoopFsRelation for persisting rows to the underlying file system. Subclasses of OutputWriter must provide a zero-argument constructor. An OutputWriter instance is created and initialized when a new output file is opened on executor side. This instance is used to persist rows to this single output file.
A factory that produces OutputWriters.
A factory that produces OutputWriters. A new OutputWriterFactory is created on driver side for each write job issued when writing to a HadoopFsRelation, and then gets serialized to executor side to create actual OutputWriters on the fly.
A collection of data files from a partitioned relation, along with the partition values in the form of an InternalRow.
Holds a directory in a partitioned collection of files as well as the partition values in the form of a Row.
Holds a directory in a partitioned collection of files as well as the partition values
in the form of a Row. Before scanning, the files at path
need to be enumerated.
A part (i.e.
A part (i.e. "block") of a single file that should be read, along with partition column values that need to be prepended to each row.
value of partition columns to be prepended to each row.
path of the file to read
the beginning offset (in bytes) of the block.
number of bytes to read.
locality information (list of nodes that have the data).
An abstract class that represents FileIndexs that are aware of partitioned tables.
An abstract class that represents FileIndexs that are aware of partitioned tables. It provides the necessary methods to parse partition data based on a set of files.
Preprocess CreateTable, to do some normalization and checking.
Preprocess the InsertIntoTable plan.
Preprocess the InsertIntoTable plan. Throws exception if the number of columns mismatch, or specified partition columns are different from the existing partition columns in the target table. It also does data type casting and field renaming, to make sure that the columns to be inserted have the correct data type and fields have the correct names.
An adaptor from a Hadoop RecordReader to an Iterator over the values returned.
An adaptor from a Hadoop RecordReader to an Iterator over the values returned.
Note that this returns Objects instead of InternalRow because we rely on erasure to pass column batches by pretending they are rows.
Try to replaces UnresolvedRelations if the plan is for direct query on files.
A variant of HadoopMapReduceCommitProtocol that allows specifying the actual Hadoop output committer using an option specified in SQLConf.
Saves the results of query
in to a data source.
Saves the results of query
in to a data source.
Note that this command is different from InsertIntoDataSourceCommand. This command will call
CreatableRelationProvider.createRelation
to write out the data, while
InsertIntoDataSourceCommand calls InsertableRelation.insert
. Ideally these 2 data source
interfaces should do the same thing, but as we've already published these 2 interfaces and the
implementations may have different logic, we have to keep these 2 different commands.
The base class file format that is based on text file.
A helper object for writing FileFormat data out to a location.
A strategy for planning scans over collections of files that might be partitioned or bucketed by user specified columns.
A strategy for planning scans over collections of files that might be partitioned or bucketed by user specified columns.
At a high level planning occurs in several phases:
Files are assigned into tasks using the following algorithm:
Use FileStatusCache.getOrCreate() to construct a globally shared file status cache.
A rule to check whether the functions are supported only when Hive support is enabled
A non-caching implementation used when partition file status caching is disabled.
A rule to do various checks before inserting into or writing to a data source table.
Create a table and optionally insert some data into it. Note that this plan is unresolved and has to be replaced by the concrete implementations during analysis.
the metadata of the table to be created.
the data writing mode
an optional logical plan representing data to write into the created table.