Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20574

Column statistics give erraneous numDistinct

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.3.2
    • None
    • Metastore, Statistics
    • Amazon EMR (BigTop based) from emr-5.9.0 to emr-5.16.0.

    Description

      1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using some tool (aws cli, hdfs command or anything)
        - S3: s3://www.smartdatahub.io/data/test.parquet
        - HTTP: http://www.smartdatahub.io/data/test.parquet
        - or the attachmen

      eg. with aws cli, wget/curl/distcp can also be used

      {{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet . }}

      {{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }}

       hdfs:///tmp/testi_parquet/test.parquet

      2) Create table default.testi_parquet2 on top of that using the schema provided

      CREATE TABLE `default.testi_parquet2`(
      {{   `rakennustu` int, }}
      {{   `kohdenimi` string, }}
      {{   `tekstisuun` int, }}
      {{   `tekstikoko` float, }}
      {{   `tekstifont` string, }}
      {{   `buix_bid` int, }}
      {{   `paivitetty` string, }}
      {{   `datanomist` string, }}
      {{   `geom_geojson` string, }}
      {{   `geom` binary, }}
      {{   `extractdate` string)}}
      {{ ROW FORMAT SERDE }}
      {{   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }}
      {{ STORED AS INPUTFORMAT }}
      {{   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }}
      {{ OUTPUTFORMAT }}
      {{   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}}
      {{ LOCATION}}
      {{   'hdfs:///tmp/testi_parquet/';}}
      {{ {{ – CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM STEP 1}}}}

      3) To collect the values showing you the actual reality of the data: Query the distinct count, min and max of column "tekstisuun"

      {{ SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM default.testi_parquet2; }}

      and note them  (min 0, max 0, distinct 1)
      4) Compute statistics for the table using

      ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;

      5) See erroneous statistics entry for numDistincts: Query the statistics by using "

      DESCRIBE FORMATTED default.testi_parquet2 tekstisuun

      " and note the ERRANEOUS numDistincts value: 2

      Attachments

        Activity

          People

            ajayjadhav Ajay Jadhav
            kivismik Mikko Kivistö
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: