[IMPALA-2457] PERCENT_RANK() returns NaN for row group with 1 row - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.3.0
Fix Version/s: Impala 2.3.0
Component/s: None
Labels:
None

Description

According to the math behind the PERCENT_RANK() value, a row group containing a single row yields a division by zero error and therefore Impala returns NaN in this case. In this example, there is only 1 'Reptile' row and that line in the result set has a NaN:

select name, kind, percent_rank() over (partition by kind order by kilos) from animals;
+-----------------------+----------+--------------------------+
| name                  | kind     | percent_rank() OVER(...) |
+-----------------------+----------+--------------------------+
| Mouse                 | Mammal   | 0                        |
| Housecat              | Mammal   | 0.2                      |
| Horse                 | Mammal   | 0.4                      |
| Polar bear            | Mammal   | 0.6                      |
| Giraffe               | Mammal   | 0.8                      |
| Elephant              | Mammal   | 1                        |
| Komodo dragon         | Reptile  | NaN                      |
| Owl                   | Bird     | 0                        |
| California Condor     | Bird     | 0.25                     |
| Andean Condor         | Bird     | 0.25                     |
| Condor                | Bird     | 0.25                     |
| Ostrich               | Bird     | 1                        |
| Fire-breathing dragon | Mythical | 0                        |
| Unicorn               | Mythical | 0                        |
+-----------------------+----------+--------------------------+

Most RDBMS examples on the web use data with multiple rows per group, so they don't illustrate what's supposed to happen in this case. But I suspect the right return value in this case might be zero.

Cf. http://docs.aws.amazon.com/redshift/latest/dg/r_WF_PERCENT_RANK.html
"The first row in any set has a PERCENT_RANK of 0."

That statement suggests the first row (and any tied rows I presume) could be special-cased and no calculation performed.

Cf. http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions109.htm

The analytic example doesn't show all of the relevant output, but there are two DEPARTMENT_IDs in the output that have only one relevant row, and in both those cases the PERCENT_RANK result is shown as 0.

I didn't find any PERCENT_RANK() in MySQL to try. I don't have a PostgreSQL instance handy.

Attachments

Activity

People

Assignee:: Sailesh Mukil

Reporter:: John Russell

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Oct/15 00:21

Updated:: 07/Oct/15 00:07

Resolved:: 07/Oct/15 00:07