Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
The way Hive currently generates plan for MERGE statement can lead to wrong results with null safe equality.
To illustrate consider the following reproducer
create table ttarget(s string, j int, flag string) stored as orc tblproperties("transactional"="true"); truncate table ttarget; insert into ttarget values('not_null', 1, 'dont udpate'), (null,2, 'update'); create table tsource (i int); insert into tsource values(null),(2);
Let's say you have the following MERGE statement
explain merge into ttarget using tsource on i<=>j when matched THEN UPDATE set flag='updated' when not matched THEN INSERT VALUES('new', 1999, 'true');
With this MERGE ONLY ONE row should match in target which should be updated. But currently due to the plan hive generate it will end up matching both rows.
This is because MERGE statement is rewritten into RIGHT OUTER JOIN + FILTER corresponding to all branches.
The part of the plan generated by hive for this statement consist of:
Map 2 Map Operator Tree: TableScan alias: tsource Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Right Outer Join 0 to 1 keys: 0 j (type: int) 1 i (type: int) nullSafes: [true] outputColumnNames: _col0, _col1, _col5, _col6 input vertices: 0 Map 1 Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Filter Operator predicate: (_col6 IS NOT DISTINCT FROM _col1) (type: boolean) Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col5 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>), _col0 (type: string), _col1 (type: int) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) sort order: + Map-reduce partition columns: UDFToInteger(_col0) (type: int) Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string), _col2 (type: int)
Result after JOIN will be :
select s,j,i from ttarget right outer join tsource on i<=>j ; NULL NULL NULL NULL NULL 2
On this resultset predicate (_col6 IS NOT DISTINCT FROM _col1) will be true for both resulting into both rows matching.