Description
For this query:
select d.d_year, count(*) cnt FROM store_sales, date_dim d, customer c WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = d.d_date_sk group by d.d_year
Current optimized plan is
== Optimized Logical Plan ==
Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) AS cnt#425L]
Project [d_year#147]
Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && (c_first_shipto_date_sk#106 = d_date_sk#141)))
Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
Join Inner, None
Project [ss_customer_sk#283]
Relation[] ParquetRelation[store_sales]
Project [d_date_sk#141,d_year#147]
Relation[] ParquetRelation[date_dim]
Project [c_customer_sk#101,c_first_shipto_date_sk#106]
Relation[] ParquetRelation[customer]
It will join store_sales and date_dim together without any condition, the condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because the bad order of joins.
The optimizer should re-order the joins, join date_dim after customer, then it can pushed down the condition correctly.
The plan should be
Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) AS cnt#425L]
Project [d_year#147]
Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
Project [c_first_shipto_date_sk#106]
Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
Project [ss_customer_sk#283]
Relation[store_sales]
Project [c_first_shipto_date_sk#106,c_customer_sk#101]
Relation[customer]
Project [d_year#147,d_date_sk#141]
Relation[date_dim]
Attachments
Issue Links
- causes
-
SPARK-26352 Join reordering should not change the order of output attributes
- Resolved
- links to