[HIVE-24579] Incorrect Result For Groupby With Limit - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.7, 3.1.2, 4.0.0
Fix Version/s: 4.0.0-alpha-1
Component/s: Physical Optimizer
Labels:
- pull-request-available

Description

create table test(id int);
explain extended select id,count(*) from test group by id limit 10;

There is an TopN unexpectly for map phase, which casues incorrect result.

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: test
                  Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                  GatherStats: false
                  Select Operator
                    expressions: id (type: int)
                    outputColumnNames: id
                    Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: count()
                      keys: id (type: int)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        null sort order: a
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                        tag: -1
                        TopN: 10
                        TopN Hash Memory Usage: 0.1
                        value expressions: _col1 (type: bigint)
                        auto parallelism: true
            Execution mode: vectorized
            Path -> Alias:
              file:/user/hive/warehouse/test [test]
            Path -> Partition:
              file:/user/hive/warehouse/test 
                Partition
                  base file name: test
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
                    bucket_count -1
                    bucketing_version 2
                    column.name.delimiter ,
                    columns id
                    columns.comments 
                    columns.types int
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    location file:/user/hive/warehouse/test
                    name default.test
                    numFiles 0
                    numRows 0
                    rawDataSize 0
                    serialization.ddl struct test { i32 id}
                    serialization.format 1
                    serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    totalSize 0
                    transient_lastDdlTime 1609730190
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    properties:
                      COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
                      bucket_count -1
                      bucketing_version 2
                      column.name.delimiter ,
                      columns id
                      columns.comments 
                      columns.types int
                      file.inputformat org.apache.hadoop.mapred.TextInputFormat
                      file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      location file:/user/hive/warehouse/test
                      name default.test
                      numFiles 0
                      numRows 0
                      rawDataSize 0
                      serialization.ddl struct test { i32 id}
                      serialization.format 1
                      serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      totalSize 0
                      transient_lastDdlTime 1609730190
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: default.test
                  name: default.test
            Truncated Path -> Alias:
              /test [test]
        Reducer 2 
            Execution mode: vectorized
            Needs Tagging: false
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                Limit
                  Number of rows: 10
                  Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    directory: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002
                    NumFilesPerFileSink: 1
                    Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE
                    Stats Publishing Key Prefix: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        properties:
                          columns _col0,_col1
                          columns.types int:bigint
                          escape.delim \
                          hive.serialization.extend.additional.nesting.levels true
                          serialization.escape.crlf true
                          serialization.format 1
                          serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    TotalFiles: 1
                    GatherStats: false
                    MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

Time taken: 0.102 seconds, Fetched: 143 row(s)

Attachments

Issue Links

is related to

HIVE-25856 Intermittent null ordering in plans of queries with GROUP BY and LIMIT

Closed

links to

GitHub Pull Request #2656

Incorrect Result For Groupby With Limit

Details

Description

Attachments

Issue Links

Activity

People

Dates

Time Tracking