[HIVE-21481] MERGE correctness issues with null safe equality - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Query Planning
Labels:
None

Target Version/s:

4.0.0

Description

The way Hive currently generates plan for MERGE statement can lead to wrong results with null safe equality.

To illustrate consider the following reproducer

create table ttarget(s string, j int, flag string) stored as orc tblproperties("transactional"="true");
truncate table ttarget;
insert into ttarget values('not_null', 1, 'dont udpate'), (null,2, 'update');

create table tsource (i int);
insert into tsource values(null),(2);

Let's say you have the following MERGE statement

explain merge into ttarget using tsource on i<=>j
 when matched THEN
 	UPDATE set flag='updated'
 when not matched THEN
  	INSERT VALUES('new', 1999, 'true');

With this MERGE ONLY ONE row should match in target which should be updated. But currently due to the plan hive generate it will end up matching both rows.

This is because MERGE statement is rewritten into RIGHT OUTER JOIN + FILTER corresponding to all branches.

The part of the plan generated by hive for this statement consist of:

Map 2
            Map Operator Tree:
                TableScan
                  alias: tsource
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                  Map Join Operator
                    condition map:
                         Right Outer Join 0 to 1
                    keys:
                      0 j (type: int)
                      1 i (type: int)
                    nullSafes: [true]
                    outputColumnNames: _col0, _col1, _col5, _col6
                    input vertices:
                      0 Map 1
                    Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE
                    HybridGraceHashJoin: true
                    Filter Operator
                      predicate: (_col6 IS NOT DISTINCT FROM _col1) (type: boolean)
                      Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col5 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>), _col0 (type: string), _col1 (type: int)
                        outputColumnNames: _col0, _col1, _col2
                        Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>)
                          sort order: +
                          Map-reduce partition columns: UDFToInteger(_col0) (type: int)
                          Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE
                          value expressions: _col1 (type: string), _col2 (type: int)

Result after JOIN will be :

select s,j,i from ttarget right outer join tsource on i<=>j ;
NULL	NULL	NULL
NULL	NULL	2

On this resultset predicate (_col6 IS NOT DISTINCT FROM _col1) will be true for both resulting into both rows matching.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Vineet Garg

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/Mar/19 23:04

Updated:: 19/Mar/19 23:04