Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20260

NDV of a column shouldn't be scaled when row count is changed by filter on another column

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 4.0.0-alpha-1
    • Statistics
    • None

    Description

      HIVE-17465 introduced progressive scaling of rowcounts in presence of multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) in such scenario. However, it should pay attention to column used in filter expression and not scale for all filters. eg.,
      consider filter a = 1 and b = 2 ndv of column b should not be scaled down by row count changes caused by a = 1
      Other way to say this that ndv of a particular column should be updated at the end of computation of row count for that operator.

      Here are the possible cases where our estimates can be accurate (or close to)

      case 1 - (d_year = 2001 and d_moy=1)
      case 2 - (d_year = 2001 and d_year IN (2001, 2002))
      case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
      case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
      case 5 - (d_date = '1999-01-01')
      

      Attachments

        1. HIVE-20260.01wip03.patch
          90 kB
          Zoltan Haindrich
        2. HIVE-20260.01wip02.patch
          87 kB
          Zoltan Haindrich
        3. HIVE-20260.01wip01.patch
          66 kB
          Zoltan Haindrich
        4. HIVE-20260.01.patch
          443 kB
          Zoltan Haindrich
        5. HIVE-20260.01.patch
          443 kB
          Zoltan Haindrich

        Issue Links

          Activity

            People

              kgyrtkirk Zoltan Haindrich
              ashutoshc Ashutosh Chauhan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: