[SPARK-22536] VectorizedParquetRecordReader doesn't use Parquet's dictionary filtering feature - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed
- filter2
- parquet
- predicate
- pushdown
Environment:

Spark 2.2.0

Description

The VectorizedParquetRecordReader currently only uses statistics filtering, and does not make use of dictionary filtering in Parquet. Having dictionary filtering would be very useful for string/binary columns that have low cardinality

Some relevant code paths:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L367-L387 When vectorizedReader is enabled, the code will use VectorizedParquetRecordReader, which uses SpecificParquetRecordReaderBase below
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L109 This is where the row group filtering is being performed. It calls the method below
https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java#L64-L70

The RowGroupFilter constructor used in Spark's VectorizedParquetRecordReader hard-codes the FilterLevel used to only FilterLevel.STATISTICS, and is deprecated.

  @Deprecated
  private RowGroupFilter(List<BlockMetaData> blocks, MessageType schema) {
    this.blocks = checkNotNull(blocks, "blocks");
    this.schema = checkNotNull(schema, "schema");
    this.levels = Collections.singletonList(FilterLevel.STATISTICS);
    this.reader = null;

Compare this to org.apache.parquet.hadoop.ParquetRecordReader.initialize(), which uses the second RowGroupFilter constructor that allows it to set the FilterLevel. Relevant code here:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java#L166-L182

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ivan Gozali

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 16/Nov/17 01:18

Updated:: 29/Sep/19 12:14

Resolved:: 21/May/19 04:12