Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21930

WINDOW COUNT DISTINCT return wrong value with PARTITION BY

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • PTF-Windowing
    • Beeline version 3.1.0.3.0.1.0-187 by Apache Hive

    Description

      count(distinct a) over (partiton by b) return wring result. For example (T is CTE here):

      select p, day, ts
      , row_number() OVER (PARTITION BY phone ORDER BY ts ASC) as line_number
      , count(1) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lines
      , count(distinct day) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as days
      FROM T

       WINDOW specification doesn't affect on results: same wrong with and without window.

      count(1) and count(distinct day) return the same result. Count distinct is wrong.

       

      I've add size(collect_set(day) OVER (PARTITION BY phone)) as days2 and count(distinct return correct result.

      Following query return non-empty result:

      select A.*, B.days, B. from (
      select p, day, ts
      , row_number() OVER (PARTITION BY phone ORDER BY ts ASC) as line_number
      , count(1) OVER (PARTITION BY p ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lines
      , count(distinct day) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as days
      , size(collect_set(day) OVER (PARTITION BY phone)) as days2
      , dense_rank() over (partition by phone order by day) + dense_rank() over (partition by phone order by day desc) - 1 as days3
      FROM T ) as A 
      join (
      select p, day, ts
      , row_number() OVER (PARTITION BY phone ORDER BY ts ASC) as line_number
      , count(1) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lines
      , count(distinct day) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as days
      FROM T
      ) as B on A.p=B.p and A.line_number=B.line_number
      where A.days!=B.days
      order by A.p, A.line_number
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            Selitsky Igor
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: