Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.3.0
-
None
-
None
Description
According to the math behind the PERCENT_RANK() value, a row group containing a single row yields a division by zero error and therefore Impala returns NaN in this case. In this example, there is only 1 'Reptile' row and that line in the result set has a NaN:
select name, kind, percent_rank() over (partition by kind order by kilos) from animals; +-----------------------+----------+--------------------------+ | name | kind | percent_rank() OVER(...) | +-----------------------+----------+--------------------------+ | Mouse | Mammal | 0 | | Housecat | Mammal | 0.2 | | Horse | Mammal | 0.4 | | Polar bear | Mammal | 0.6 | | Giraffe | Mammal | 0.8 | | Elephant | Mammal | 1 | | Komodo dragon | Reptile | NaN | | Owl | Bird | 0 | | California Condor | Bird | 0.25 | | Andean Condor | Bird | 0.25 | | Condor | Bird | 0.25 | | Ostrich | Bird | 1 | | Fire-breathing dragon | Mythical | 0 | | Unicorn | Mythical | 0 | +-----------------------+----------+--------------------------+
Most RDBMS examples on the web use data with multiple rows per group, so they don't illustrate what's supposed to happen in this case. But I suspect the right return value in this case might be zero.
Cf. http://docs.aws.amazon.com/redshift/latest/dg/r_WF_PERCENT_RANK.html
"The first row in any set has a PERCENT_RANK of 0."
That statement suggests the first row (and any tied rows I presume) could be special-cased and no calculation performed.
Cf. http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions109.htm
The analytic example doesn't show all of the relevant output, but there are two DEPARTMENT_IDs in the output that have only one relevant row, and in both those cases the PERCENT_RANK result is shown as 0.
I didn't find any PERCENT_RANK() in MySQL to try. I don't have a PostgreSQL instance handy.