Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-12465

Hive might produce wrong results when (outer) joins are merged

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.3.0, 2.0.0
    • 1.3.0, 2.0.0
    • None
    • None

    Description

      Consider the following query:

      select * from
        (select * from tab where tab.key = 0)a
      full outer join
        (select * from tab_part where tab_part.key = 98)b
      join
        tab_part c
      on a.key = b.key and b.key = c.key;
      

      Hive should execute the full outer join operation (without ON clause) and then the join operation (ON a.key = b.key and b.key = c.key). Instead, it merges both joins, generating the following plan:

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Map Operator Tree:
                TableScan
                  alias: tab
                  filterExpr: (key = 0) (type: boolean)
                  Statistics: Num rows: 242 Data size: 22748 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (key = 0) (type: boolean)
                    Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: 0 (type: int), value (type: string), ds (type: string)
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col1 (type: string), _col2 (type: string)
                TableScan
                  alias: tab_part
                  filterExpr: (key = 98) (type: boolean)
                  Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (key = 98) (type: boolean)
                    Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: 98 (type: int), value (type: string), ds (type: string)
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col1 (type: string), _col2 (type: string)
                TableScan
                  alias: c
                  Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: key (type: int)
                    sort order: +
                    Map-reduce partition columns: key (type: int)
                    Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE
                    value expressions: value (type: string), ds (type: string)
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Outer Join 0 to 1
                     Inner Join 1 to 2
                keys:
                  0 _col0 (type: int)
                  1 _col0 (type: int)
                  2 key (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8
                Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink
      

      That plan is equivalent to the following query, which is different than the original one:

      select * from
        (select * from tab where tab.key = 0)a
      full outer join
        (select * from tab_part where tab_part.key = 98)b
      on a.key = b.key
      join
        tab_part c
      on b.key = c.key;
      

      It seems to be a problem in the recognition of join operations that can be merged into a single multijoin operator.

      Attachments

        1. HIVE-12465.01.patch
          30 kB
          jcamachorodriguez
        2. HIVE-12465.02.patch
          59 kB
          jcamachorodriguez
        3. HIVE-12465.patch
          2 kB
          jcamachorodriguez
        4. HIVE-12465-branch1.patch
          3 kB
          jcamachorodriguez

        Issue Links

          Activity

            People

              jcamacho Jesús Camacho Rodríguez
              jcamacho Jesús Camacho Rodríguez
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: