Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22993

Include Bloom Filter in Column Statistics to Better Estimate nDV

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • CBO, Statistics
    • None

    Description

      When performing an INSERT statement, Hive has no way to determine the number of distinct values since the distinct values themselves are not recorded.

      create table test_mm(`id` int, `my_dt` date);
      
      insert into test_mm values (1, "2018-10-01"), (2, "2018-10-01"), (3, "2018-10-01"),
      (4, "2017-10-01"), (5, "2017-10-01"), (6, "2017-10-01"),
      (7, "2010-10-01"), (8, "2010-10-01"), (9, "2010-10-01"),
      (10, "1998-10-01"), (11, "1998-10-01"), (12, "1998-10-01");
      
      DESCRIBE FORMATTED test_mm my_dt;
      -- distinct_count: 4
      
      insert into test_mm values (13, "2030-10-01"), (14, "2030-10-01"), (15, "2030-10-01");
      
      DESCRIBE FORMATTED test_mm my_dt;
      -- distinct_count: 4
      

      The first INSERT statement sees that there are 0 records, so it makes sense that any distinct values marked in the statistics. However, for the second INSERT, Hive has no idea if "2030-10-01" is distinct, so the distinct_count is unchanged. By introducing a bloom filter for column statistics, the second INSERT may be able to determine that "2030-10-01" is indeed unique and update the distinct_count accordingly.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              belugabehr David Mollitor
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: