Details
Description
The VectorizedParquetRecordReader currently only uses statistics filtering, and does not make use of dictionary filtering in Parquet. Having dictionary filtering would be very useful for string/binary columns that have low cardinality
Some relevant code paths:
- https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L367-L387 When vectorizedReader is enabled, the code will use VectorizedParquetRecordReader, which uses SpecificParquetRecordReaderBase below
- https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L109 This is where the row group filtering is being performed. It calls the method below
- https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java#L64-L70
The RowGroupFilter constructor used in Spark's VectorizedParquetRecordReader hard-codes the FilterLevel used to only FilterLevel.STATISTICS, and is deprecated.
@Deprecated private RowGroupFilter(List<BlockMetaData> blocks, MessageType schema) { this.blocks = checkNotNull(blocks, "blocks"); this.schema = checkNotNull(schema, "schema"); this.levels = Collections.singletonList(FilterLevel.STATISTICS); this.reader = null;
Compare this to org.apache.parquet.hadoop.ParquetRecordReader.initialize(), which uses the second RowGroupFilter constructor that allows it to set the FilterLevel. Relevant code here:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java#L166-L182