Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
3.1.0, 3.1.2
Description
Vectorized PTF for count and distinct over partition is broken. It produces incorrect results.
Below is the test case.
CREATE TABLE bigd781b_new ( id int, txt1 string, txt2 string, cda_date int, cda_job_name varchar(12)); INSERT INTO bigd781b_new VALUES (1,'2010005759','7164335675012038',20200528,'load1'), (2,'2010005759','7164335675012038',20200528,'load2');
Running below query produces incorrect results
SELECT txt1, txt2, count(distinct txt1) over(partition by txt1) as n, count(distinct txt2) over(partition by txt2) as m FROM bigd781b_new
as below.
+-------------+-------------------+----+----+ | txt1 | txt2 | n | m | +-------------+-------------------+----+----+ | 2010005759 | 7164335675012038 | 2 | 2 | | 2010005759 | 7164335675012038 | 2 | 2 | +-------------+-------------------+----+----+
While the correct output would be
+-------------+-------------------+----+----+ | txt1 | txt2 | n | m | +-------------+-------------------+----+----+ | 2010005759 | 7164335675012038 | 1 | 1 | | 2010005759 | 7164335675012038 | 1 | 1 | +-------------+-------------------+----+----+
The problem does not appear after setting below property
set hive.vectorized.execution.ptf.enabled=false;
Attachments
Issue Links
- links to