Override QueryExecution with special debug workflow.
Analyzes the given table in the current database to generate statistics, which will be used in query optimizations.
Analyzes the given table in the current database to generate statistics, which will be used in query optimizations.
Right now, it only supports Hive tables and it only updates the size of a Hive table in the Hive metastore.
1.2.0
Sets up the system initially or after a RESET command
Sets up the system initially or after a RESET command
When true, a table created by a Hive CTAS statement (no USING clause) will be converted to a data source table, using the data source set by spark.
When true, a table created by a Hive CTAS statement (no USING clause) will be converted to a data source table, using the data source set by spark.sql.sources.default. The table in CTAS statement will be converted when it meets any of the following conditions:
When true, enables an experimental feature where metastore tables that use the parquet SerDe are automatically converted to use the Spark SQL parquet table scan, instead of the Hive SerDe.
When true, enables an experimental feature where metastore tables that use the parquet SerDe are automatically converted to use the Spark SQL parquet table scan, instead of the Hive SerDe.
When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files.
When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files.
This configuration is only effective when "spark.sql.hive.convertMetastoreParquet" is true.
The copy of the hive client that is used for execution.
The copy of the hive client that is used for execution. Currently this must always be Hive 13 as this is the version of Hive that is packaged with Spark SQL. This copy of the client is used for execution related tasks like registering temporary functions or ensuring that the ThreadLocal SessionState is correctly populated. This copy of Hive is *not* used for storing persistent metadata, and only point to a dummy metastore in a temporary directory.
The location of the hive source code.
The location of the compiled hive distribution
A comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with.
A comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a prefix that typically would be shared (i.e. org.apache.spark.*)
The location of the jars that should be used to instantiate the HiveMetastoreClient.
The location of the jars that should be used to instantiate the HiveMetastoreClient. This property can be one of three options:
A comma separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive.
A comma separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j.
The version of the hive client that will be used to communicate with the metastore.
The version of the hive client that will be used to communicate with the metastore. Note that this does not necessarily need to be the same version of Hive that is used internally by Spark SQL for execution.
The copy of the Hive client that is used to retrieve metadata from the Hive MetaStore.
The copy of the Hive client that is used to retrieve metadata from the Hive MetaStore. The version of the Hive client that is used here must match the metastore that is configured in the hive-site.xml file.
Records the UDFs present when the server starts, so we can delete ones that are created by tests.
Records the UDFs present when the server starts, so we can delete ones that are created by tests.
Invalidate and refresh all the cached the metadata of the given table.
Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache.
1.3.0
Resets the test instance by deleting any tables that have been created.
Resets the test instance by deleting any tables that have been created. TODO: also clear out UDFs, views, etc.
A list of test tables and the DDL required to initialize them.
A list of test tables and the DDL required to initialize them. A test table is loaded on demand when a query are run against it.
(Since version 1.3.0) use createDataFrame
(Since version 1.3.0) use createDataFrame
(Since version 1.3.0) use createDataFrame
(Since version 1.3.0) use createDataFrame
(Since version 1.4.0) use read.jdbc()
(Since version 1.4.0) use read.jdbc()
(Since version 1.4.0) use read.jdbc()
(Since version 1.4.0) Use read.json()
(Since version 1.4.0) Use read.json()
(Since version 1.4.0) Use read.json()
(Since version 1.4.0) Use read.json()
(Since version 1.4.0) Use read.json()
(Since version 1.4.0) Use read.json()
(Since version 1.4.0) Use read.json()
(Since version 1.4.0) Use read.json()
(Since version 1.4.0) Use read.json()
(Since version 1.4.0) Use read.format(source).schema(schema).options(options).load()
(Since version 1.4.0) Use read.format(source).schema(schema).options(options).load()
(Since version 1.4.0) Use read.format(source).options(options).load()
(Since version 1.4.0) Use read.format(source).options(options).load()
(Since version 1.4.0) Use read.format(source).load(path)
(Since version 1.4.0) Use read.load(path)
(Since version 1.4.0) Use read.parquet()
A locally running test instance of Spark's Hive execution engine.
Data from testTables will be automatically loaded whenever a query is run over those tables. Calling reset will delete all tables and other state in the database, leaving the database in a "clean" state.
TestHive is singleton object version of this class because instantiating multiple copies of the hive metastore seems to lead to weird non-deterministic failures. Therefore, the execution of test cases that rely on TestHive must be serialized.