Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24245

Vectorized PTF with count and distinct over partition producing incorrect results.

    XMLWordPrintableJSON

Details

    Description

      Vectorized PTF for count and distinct over partition is broken. It produces incorrect results.
      Below is the test case.

      CREATE TABLE bigd781b_new (
        id int,
        txt1 string,
        txt2 string,
        cda_date int,
        cda_job_name varchar(12));
      
      INSERT INTO bigd781b_new VALUES 
        (1,'2010005759','7164335675012038',20200528,'load1'),
        (2,'2010005759','7164335675012038',20200528,'load2');
      

      Running below query produces incorrect results

      SELECT
          txt1,
          txt2,
          count(distinct txt1) over(partition by txt1) as n,
          count(distinct txt2) over(partition by txt2) as m
      FROM bigd781b_new
      

      as below.

      +-------------+-------------------+----+----+
      |    txt1     |       txt2        | n  | m  |
      +-------------+-------------------+----+----+
      | 2010005759  | 7164335675012038  | 2  | 2  |
      | 2010005759  | 7164335675012038  | 2  | 2  |
      +-------------+-------------------+----+----+
      

      While the correct output would be

      +-------------+-------------------+----+----+
      |    txt1     |       txt2        | n  | m  |
      +-------------+-------------------+----+----+
      | 2010005759  | 7164335675012038  | 1  | 1  |
      | 2010005759  | 7164335675012038  | 1  | 1  |
      +-------------+-------------------+----+----+
      

      The problem does not appear after setting below property
      set hive.vectorized.execution.ptf.enabled=false;

      Attachments

        Issue Links

          Activity

            People

              abstractdog László Bodor
              chiran54321 Chiran Ravani
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h