Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-10841

[WHERE col is not null] does not work sometimes for queries with many JOIN statements

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      The result from the following SELECT query is 3 rows but it should be 1 row.
      I checked it in MySQL - it returned 1 row.

      To reproduce the issue in Hive
      1. prepare tables

      drop table if exists L;
      drop table if exists LA;
      drop table if exists FR;
      drop table if exists A;
      drop table if exists PI;
      drop table if exists acct;
      
      create table L as select 4436 id;
      create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
      create table FR as select 4436 loan_id;
      create table A as select 4748 id;
      create table PI as select 4415 id;
      
      create table acct as select 4748 aid, 10 acc_n, 122 brn;
      insert into table acct values(4748, null, null);
      insert into table acct values(4748, null, null);
      

      2. run SELECT query

      select
        acct.ACC_N,
        acct.brn
      FROM L
      JOIN LA ON L.id = LA.loan_id
      JOIN FR ON L.id = FR.loan_id
      JOIN A ON LA.aid = A.id
      JOIN PI ON PI.id = LA.pi_id
      JOIN acct ON A.id = acct.aid
      WHERE
        L.id = 4436
        and acct.brn is not null;
      

      the result is 3 rows

      10	122
      NULL	NULL
      NULL	NULL
      

      but it should be 1 row

      10	122
      

      2.1 "explain select ..." output for hive-1.3.0 MR

      STAGE DEPENDENCIES:
        Stage-12 is a root stage
        Stage-9 depends on stages: Stage-12
        Stage-0 depends on stages: Stage-9
      
      STAGE PLANS:
        Stage: Stage-12
          Map Reduce Local Work
            Alias -> Map Local Tables:
              a 
                Fetch Operator
                  limit: -1
              acct 
                Fetch Operator
                  limit: -1
              fr 
                Fetch Operator
                  limit: -1
              l 
                Fetch Operator
                  limit: -1
              pi 
                Fetch Operator
                  limit: -1
            Alias -> Map Local Operator Tree:
              a 
                TableScan
                  alias: a
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 _col5 (type: int)
                        1 id (type: int)
                        2 aid (type: int)
              acct 
                TableScan
                  alias: acct
                  Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: aid is not null (type: boolean)
                    Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 _col5 (type: int)
                        1 id (type: int)
                        2 aid (type: int)
              fr 
                TableScan
                  alias: fr
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (loan_id = 4436) (type: boolean)
                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 4436 (type: int)
                        1 4436 (type: int)
                        2 4436 (type: int)
              l 
                TableScan
                  alias: l
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (id = 4436) (type: boolean)
                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 4436 (type: int)
                        1 4436 (type: int)
                        2 4436 (type: int)
              pi 
                TableScan
                  alias: pi
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 _col6 (type: int)
                        1 id (type: int)
      
        Stage: Stage-9
          Map Reduce
            Map Operator Tree:
                TableScan
                  alias: la
                  Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (((loan_id is not null and aid is not null) and pi_id is not null) and (loan_id = 4436)) (type: boolean)
                    Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                           Inner Join 0 to 2
                      keys:
                        0 4436 (type: int)
                        1 4436 (type: int)
                        2 4436 (type: int)
                      outputColumnNames: _col5, _col6
                      Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                             Inner Join 1 to 2
                        keys:
                          0 _col5 (type: int)
                          1 id (type: int)
                          2 aid (type: int)
                        outputColumnNames: _col6, _col19, _col20
                        Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          keys:
                            0 _col6 (type: int)
                            1 id (type: int)
                          outputColumnNames: _col19, _col20
                          Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: _col19 (type: int), _col20 (type: int)
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
                            File Output Operator
                              compressed: false
                              Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
                              table:
                                  input format: org.apache.hadoop.mapred.TextInputFormat
                                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink
      
      Time taken: 0.57 seconds, Fetched: 142 row(s)
      

      2.2. "explain select..." output for hive-0.13.1 Tez

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 is a root stage
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
            Edges:
              Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE)
              Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE)
              Reducer 6 <- Map 5 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 8 (SIMPLE_EDGE)
            DagName: lcapp_20150528111717_06c57a5b-8dc6-4ce9-bce7-b9e0a7818fe4:1
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: acct
                        Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: aid (type: int)
                          sort order: +
                          Map-reduce partition columns: aid (type: int)
                          Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                          value expressions: acc_n (type: int), brn (type: int)
              Map 4 
                  Map Operator Tree:
                      TableScan
                        alias: a
                        Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: id (type: int)
                          sort order: +
                          Map-reduce partition columns: id (type: int)
                          Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
              Map 5 
                  Map Operator Tree:
                      TableScan
                        alias: la
                        Statistics: Num rows: 28 Data size: 347 Basic stats: COMPLETE Column stats: NONE
                        Filter Operator
                          predicate: (loan_id = 4436) (type: boolean)
                          Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: loan_id (type: int)
                            sort order: +
                            Map-reduce partition columns: loan_id (type: int)
                            Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column stats: NONE
                            value expressions: aid (type: int), pi_id (type: int)
              Map 7 
                  Map Operator Tree:
                      TableScan
                        alias: fr
                        Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
                        Filter Operator
                          predicate: (loan_id = 4436) (type: boolean)
                          Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: loan_id (type: int)
                            sort order: +
                            Map-reduce partition columns: loan_id (type: int)
                            Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Map 8 
                  Map Operator Tree:
                      TableScan
                        alias: l
                        Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
                        Filter Operator
                          predicate: (id = 4436) (type: boolean)
                          Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: id (type: int)
                            sort order: +
                            Map-reduce partition columns: id (type: int)
                            Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Map 9 
                  Map Operator Tree:
                      TableScan
                        alias: pi
                        Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: id (type: int)
                          sort order: +
                          Map-reduce partition columns: id (type: int)
                          Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
              Reducer 2 
                  Reduce Operator Tree:
                    Join Operator
                      condition map:
                           Inner Join 0 to 1
                           Inner Join 1 to 2
                      condition expressions:
                        0 {VALUE._col2}
                        1 
                        2 {VALUE._col1} {VALUE._col2}
                      outputColumnNames: _col2, _col15, _col16
                      Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col2 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col2 (type: int)
                        Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col15 (type: int), _col16 (type: int)
              Reducer 3 
                  Reduce Operator Tree:
                    Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {VALUE._col1} {VALUE._col2}
                        1 
                      outputColumnNames: _col1, _col2
                      Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col1 (type: int), _col2 (type: int)
                        outputColumnNames: _col0, _col1
                        Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              Reducer 6 
                  Reduce Operator Tree:
                    Join Operator
                      condition map:
                           Inner Join 0 to 1
                           Inner Join 0 to 2
                      condition expressions:
                        0 
                        1 {VALUE._col1} {VALUE._col2}
                        2 
                      outputColumnNames: _col4, _col5
                      Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col4 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col4 (type: int)
                        Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col5 (type: int)
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
      
      Time taken: 1.377 seconds, Fetched: 146 row(s)
      

      3. The workaround is to put "acct.brn is not null" to join condition

      select
        acct.ACC_N,
        acct.brn
      FROM L
      JOIN LA ON L.id = LA.loan_id
      JOIN FR ON L.id = FR.loan_id
      JOIN A ON LA.aid = A.id
      JOIN PI ON PI.id = LA.pi_id
      JOIN acct ON A.id = acct.aid and acct.brn is not null
      WHERE
        L.id = 4436;
      
      OK
      10	122
      Time taken: 23.479 seconds, Fetched: 1 row(s)
      

      I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations have the issue

      Attachments

        1. HIVE-10841.03.patch
          16 kB
          jcamachorodriguez
        2. HIVE-10841.1.patch
          3 kB
          Laljo John Pullokkaran
        3. HIVE-10841.2.patch
          2 kB
          Laljo John Pullokkaran
        4. HIVE-10841.patch
          0.7 kB
          Laljo John Pullokkaran

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jpullokkaran Laljo John Pullokkaran Assign to me
            apivovarov Alexander Pivovarov
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment