package bucketing
- Alphabetic
- Public
- Protected
Value Members
- object CoalesceBucketsInJoin extends Rule[SparkPlan]
This rule coalesces one side of the
SortMergeJoin
andShuffledHashJoin
if the following conditions are met:This rule coalesces one side of the
SortMergeJoin
andShuffledHashJoin
if the following conditions are met:- Two bucketed tables are joined.
- Join keys match with output partition expressions on their respective sides.
- The larger bucket number is divisible by the smaller bucket number.
- COALESCE_BUCKETS_IN_JOIN_ENABLED is set to true.
- The ratio of the number of buckets is less than the value set in COALESCE_BUCKETS_IN_JOIN_MAX_BUCKET_RATIO.
- object DisableUnnecessaryBucketedScan extends Rule[SparkPlan]
Disable unnecessary bucketed table scan based on actual physical query plan.
Disable unnecessary bucketed table scan based on actual physical query plan. NOTE: this rule is designed to be applied right after EnsureRequirements, where all ShuffleExchangeExec and SortExec have been added to plan properly.
When BUCKETING_ENABLED and AUTO_BUCKETED_SCAN_ENABLED are set to true, go through query plan to check where bucketed table scan is unnecessary, and disable bucketed table scan if:
1. The sub-plan from root to bucketed table scan, does not contain hasInterestingPartition operator.
2. The sub-plan from the nearest downstream hasInterestingPartition operator to the bucketed table scan, contains only isAllowedUnaryExecNode operators and at least one Exchange.
Examples: 1. no hasInterestingPartition operator: Project | Filter | Scan(t1: i, j) (bucketed on column j, DISABLE bucketed scan)
2. join: SortMergeJoin(t1.i = t2.j) / \ Sort(i) Sort(j) / \ Shuffle(i) Scan(t2: i, j) / (bucketed on column j, enable bucketed scan) Scan(t1: i, j) (bucketed on column j, DISABLE bucketed scan)
3. aggregate: HashAggregate(i, ..., Final) | Shuffle(i) | HashAggregate(i, ..., Partial) | Filter | Scan(t1: i, j) (bucketed on column j, DISABLE bucketed scan)
The idea of hasInterestingPartition is inspired from "interesting order" in the paper "Access Path Selection in a Relational Database Management System" (https://dl.acm.org/doi/10.1145/582095.582099).
- object ExtractJoinWithBuckets
An extractor that extracts
SortMergeJoinExec
andShuffledHashJoin
, where both sides of the join have the bucketed tables, are consisted of only the scan operation, and numbers of buckets are not equal but divisible.