[HIVE-20574] Column statistics give erraneous numDistinct - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: Metastore, Statistics
Labels:
- Statistics
- statsCollection
Environment:

Amazon EMR (BigTop based) from emr-5.9.0 to emr-5.16.0.

Description

1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using some tool (aws cli, hdfs command or anything)
- S3: s3://www.smartdatahub.io/data/test.parquet
- HTTP: http://www.smartdatahub.io/data/test.parquet
- or the attachmen

eg. with aws cli, wget/curl/distcp can also be used

{{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }}

hdfs:///tmp/testi_parquet/test.parquet

2) Create table default.testi_parquet2 on top of that using the schema provided

CREATE TABLE `default.testi_parquet2`(
{{ `rakennustu` int, }}
{{ `kohdenimi` string, }}
{{ `tekstisuun` int, }}
{{ `tekstikoko` float, }}
{{ `tekstifont` string, }}
{{ `buix_bid` int, }}
{{ `paivitetty` string, }}
{{ `datanomist` string, }}
{{ `geom_geojson` string, }}
{{ `geom` binary, }}
{{ `extractdate` string)}}
{{ ROW FORMAT SERDE }}
{{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }}
{{ STORED AS INPUTFORMAT }}
{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }}
{{ OUTPUTFORMAT }}
{{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}}
{{ LOCATION}}
{{ 'hdfs:///tmp/testi_parquet/';}}
{{ {{ – CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM STEP 1}}}}

3) To collect the values showing you the actual reality of the data: Query the distinct count, min and max of column "tekstisuun"

{{ SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM default.testi_parquet2; }}

and note them (min 0, max 0, distinct 1)
4) Compute statistics for the table using

ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;

5) See erroneous statistics entry for numDistincts: Query the statistics by using "

DESCRIBE FORMATTED default.testi_parquet2 tekstisuun

" and note the ERRANEOUS numDistincts value: 2

Attachments

Activity

People

Assignee:: Ajay Jadhav

Reporter:: Mikko Kivistö

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Sep/18 08:25

Updated:: 24/Sep/18 17:12

Resolved:: 24/Sep/18 17:12