Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2457

PERCENT_RANK() returns NaN for row group with 1 row

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.3.0
    • Impala 2.3.0
    • None
    • None

    Description

      According to the math behind the PERCENT_RANK() value, a row group containing a single row yields a division by zero error and therefore Impala returns NaN in this case. In this example, there is only 1 'Reptile' row and that line in the result set has a NaN:

      select name, kind, percent_rank() over (partition by kind order by kilos) from animals;
      +-----------------------+----------+--------------------------+
      | name                  | kind     | percent_rank() OVER(...) |
      +-----------------------+----------+--------------------------+
      | Mouse                 | Mammal   | 0                        |
      | Housecat              | Mammal   | 0.2                      |
      | Horse                 | Mammal   | 0.4                      |
      | Polar bear            | Mammal   | 0.6                      |
      | Giraffe               | Mammal   | 0.8                      |
      | Elephant              | Mammal   | 1                        |
      | Komodo dragon         | Reptile  | NaN                      |
      | Owl                   | Bird     | 0                        |
      | California Condor     | Bird     | 0.25                     |
      | Andean Condor         | Bird     | 0.25                     |
      | Condor                | Bird     | 0.25                     |
      | Ostrich               | Bird     | 1                        |
      | Fire-breathing dragon | Mythical | 0                        |
      | Unicorn               | Mythical | 0                        |
      +-----------------------+----------+--------------------------+
      

      Most RDBMS examples on the web use data with multiple rows per group, so they don't illustrate what's supposed to happen in this case. But I suspect the right return value in this case might be zero.

      Cf. http://docs.aws.amazon.com/redshift/latest/dg/r_WF_PERCENT_RANK.html
      "The first row in any set has a PERCENT_RANK of 0."

      That statement suggests the first row (and any tied rows I presume) could be special-cased and no calculation performed.

      Cf. http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions109.htm

      The analytic example doesn't show all of the relevant output, but there are two DEPARTMENT_IDs in the output that have only one relevant row, and in both those cases the PERCENT_RANK result is shown as 0.

      I didn't find any PERCENT_RANK() in MySQL to try. I don't have a PostgreSQL instance handy.

      Attachments

        Activity

          People

            sailesh Sailesh Mukil
            jrussell John Russell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: