org.apache.spark.sql.catalyst.plans.logical.statsEstimation
Estimate output size and number of rows after a join operator, and update output column stats.
The number of rows of A inner join B on A.k1 = B.k1 is estimated by this basic formula: T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the number of distinct values of that column.
The number of rows of A inner join B on A.k1 = B.k1 is estimated by this basic formula: T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the number of distinct values of that column. The underlying assumption for this formula is: each value of the smaller domain is included in the larger domain. Generally, inner join with multiple join keys can also be estimated based on the above formula: T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), V(B.k2)) * ... * max(V(A.kn), V(B.kn))) However, the denominator can become very large and excessively reduce the result, so we use a conservative strategy to take only the largest max(V(A.ki), V(B.ki)) as the denominator.