Details
-
Bug
-
Status: Patch Available
-
Blocker
-
Resolution: Unresolved
-
3.1.2
-
None
-
None
-
None
Description
A map join on a column (which is neither involved in bucketing and partition) causes data loss.
Steps to reproduce:
Env: [hive-dev-box|https://github.com/kgyrtkirk/hive-dev-box] hive 3.1.2.
Create tables:
CREATE TABLE `testj2`( `id` int, `bn` string, `cn` string, `ad` map<string,int>, `mi` array<int>) PARTITIONED BY ( `br` string) CLUSTERED BY ( bn) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES ( 'bucketing_version'='2'); CREATE TABLE `testj1`( `id` int, `can` string, `cn` string, `ad` map<string,int>, `av` boolean, `mi` array<int>) PARTITIONED BY ( `brand` string) CLUSTERED BY ( can) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES ( 'bucketing_version'='2');
insert some data in both:
insert into testj1 values (100, 'mes_1', 'customer_1', map('city1', 560077), false, array(5, 10), 'brand_1'), (101, 'mes_2', 'customer_2', map('city2', 560078), true, array(10, 20), 'brand_2'), (102, 'mes_3', 'customer_3', map('city3', 560079), false, array(15, 30), 'brand_3'), (103, 'mes_4', 'customer_4', map('city4', 560080), true, array(20, 40), 'brand_4'), (104, 'mes_5', 'customer_5', map('city5', 560081), false, array(25, 50), 'brand_5'); insert into table testj2 values (100, 'tv_0', 'customer_0', map('city0', 560076),array(0, 0, 0), 'tv'), (101, 'tv_1', 'customer_1', map('city1', 560077),array(20, 25, 30), 'tv'), (102, 'tv_2', 'customer_2', map('city2', 560078),array(40, 50, 60), 'tv'), (103, 'tv_3', 'customer_3', map('city3', 560079),array(60, 75, 90), 'tv'), (104, 'tv_4', 'customer_4', map('city4', 560080),array(80, 100, 120), 'tv');
Do a join between them:
select t1.id, t1.can, t1.cn, t2.bn,t2.ad, t2.br FROM testj1 t1 JOIN testj2 t2 on (t1.id = t2.id) order by t1.id;
Observed results:
In the plan, I can see a map join. Disabling it gives the correct result.