Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21481

MERGE correctness issues with null safe equality

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Query Planning
    • None

    Description

      The way Hive currently generates plan for MERGE statement can lead to wrong results with null safe equality.

      To illustrate consider the following reproducer

      create table ttarget(s string, j int, flag string) stored as orc tblproperties("transactional"="true");
      truncate table ttarget;
      insert into ttarget values('not_null', 1, 'dont udpate'), (null,2, 'update');
      
      create table tsource (i int);
      insert into tsource values(null),(2);
      

      Let's say you have the following MERGE statement

      explain merge into ttarget using tsource on i<=>j
       when matched THEN
       	UPDATE set flag='updated'
       when not matched THEN
        	INSERT VALUES('new', 1999, 'true');
      

      With this MERGE ONLY ONE row should match in target which should be updated. But currently due to the plan hive generate it will end up matching both rows.

      This is because MERGE statement is rewritten into RIGHT OUTER JOIN + FILTER corresponding to all branches.

      The part of the plan generated by hive for this statement consist of:

      Map 2
                  Map Operator Tree:
                      TableScan
                        alias: tsource
                        Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                        Map Join Operator
                          condition map:
                               Right Outer Join 0 to 1
                          keys:
                            0 j (type: int)
                            1 i (type: int)
                          nullSafes: [true]
                          outputColumnNames: _col0, _col1, _col5, _col6
                          input vertices:
                            0 Map 1
                          Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE
                          HybridGraceHashJoin: true
                          Filter Operator
                            predicate: (_col6 IS NOT DISTINCT FROM _col1) (type: boolean)
                            Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE
                            Select Operator
                              expressions: _col5 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>), _col0 (type: string), _col1 (type: int)
                              outputColumnNames: _col0, _col1, _col2
                              Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE
                              Reduce Output Operator
                                key expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>)
                                sort order: +
                                Map-reduce partition columns: UDFToInteger(_col0) (type: int)
                                Statistics: Num rows: 1 Data size: 206 Basic stats: COMPLETE Column stats: NONE
                                value expressions: _col1 (type: string), _col2 (type: int)
      

      Result after JOIN will be :

      select s,j,i from ttarget right outer join tsource on i<=>j ;
      NULL	NULL	NULL
      NULL	NULL	2
      

      On this resultset predicate (_col6 IS NOT DISTINCT FROM _col1) will be true for both resulting into both rows matching.

      Attachments

        Activity

          People

            Unassigned Unassigned
            vgarg Vineet Garg
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: