Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24579

Incorrect Result For Groupby With Limit

    XMLWordPrintableJSON

Details

    Description

      create table test(id int);
      explain extended select id,count(*) from test group by id limit 10;
      

      There is an TopN unexpectly for map phase, which casues incorrect result.

      STAGE PLANS:
        Stage: Stage-1
          Tez
            DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
            Edges:
              Reducer 2 <- Map 1 (SIMPLE_EDGE)
            DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: test
                        Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                        GatherStats: false
                        Select Operator
                          expressions: id (type: int)
                          outputColumnNames: id
                          Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                          Group By Operator
                            aggregations: count()
                            keys: id (type: int)
                            mode: hash
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              null sort order: a
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                              tag: -1
                              TopN: 10
                              TopN Hash Memory Usage: 0.1
                              value expressions: _col1 (type: bigint)
                              auto parallelism: true
                  Execution mode: vectorized
                  Path -> Alias:
                    file:/user/hive/warehouse/test [test]
                  Path -> Partition:
                    file:/user/hive/warehouse/test 
                      Partition
                        base file name: test
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        properties:
                          COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
                          bucket_count -1
                          bucketing_version 2
                          column.name.delimiter ,
                          columns id
                          columns.comments 
                          columns.types int
                          file.inputformat org.apache.hadoop.mapred.TextInputFormat
                          file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          location file:/user/hive/warehouse/test
                          name default.test
                          numFiles 0
                          numRows 0
                          rawDataSize 0
                          serialization.ddl struct test { i32 id}
                          serialization.format 1
                          serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          totalSize 0
                          transient_lastDdlTime 1609730190
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          properties:
                            COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
                            bucket_count -1
                            bucketing_version 2
                            column.name.delimiter ,
                            columns id
                            columns.comments 
                            columns.types int
                            file.inputformat org.apache.hadoop.mapred.TextInputFormat
                            file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                            location file:/user/hive/warehouse/test
                            name default.test
                            numFiles 0
                            numRows 0
                            rawDataSize 0
                            serialization.ddl struct test { i32 id}
                            serialization.format 1
                            serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                            totalSize 0
                            transient_lastDdlTime 1609730190
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          name: default.test
                        name: default.test
                  Truncated Path -> Alias:
                    /test [test]
              Reducer 2 
                  Execution mode: vectorized
                  Needs Tagging: false
                  Reduce Operator Tree:
                    Group By Operator
                      aggregations: count(VALUE._col0)
                      keys: KEY._col0 (type: int)
                      mode: mergepartial
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                      Limit
                        Number of rows: 10
                        Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          GlobalTableId: 0
                          directory: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002
                          NumFilesPerFileSink: 1
                          Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                          Stats Publishing Key Prefix: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/
                          table:
                              input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                              properties:
                                columns _col0,_col1
                                columns.types int:bigint
                                escape.delim \
                                hive.serialization.extend.additional.nesting.levels true
                                serialization.escape.crlf true
                                serialization.format 1
                                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          TotalFiles: 1
                          GatherStats: false
                          MultiFileSpray: false
      
        Stage: Stage-0
          Fetch Operator
            limit: 10
            Processor Tree:
              ListSink
      
      Time taken: 0.102 seconds, Fetched: 143 row(s)
      
      

       

      Attachments

        Issue Links

          Activity

            People

              kkasa Krisztian Kasa
              nemon Nemon Lou
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m