Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22536

VectorizedParquetRecordReader doesn't use Parquet's dictionary filtering feature

    XMLWordPrintableJSON

Details

    Description

      The VectorizedParquetRecordReader currently only uses statistics filtering, and does not make use of dictionary filtering in Parquet. Having dictionary filtering would be very useful for string/binary columns that have low cardinality

      Some relevant code paths:

      The RowGroupFilter constructor used in Spark's VectorizedParquetRecordReader hard-codes the FilterLevel used to only FilterLevel.STATISTICS, and is deprecated.

        @Deprecated
        private RowGroupFilter(List<BlockMetaData> blocks, MessageType schema) {
          this.blocks = checkNotNull(blocks, "blocks");
          this.schema = checkNotNull(schema, "schema");
          this.levels = Collections.singletonList(FilterLevel.STATISTICS);
          this.reader = null;
      

      Compare this to org.apache.parquet.hadoop.ParquetRecordReader.initialize(), which uses the second RowGroupFilter constructor that allows it to set the FilterLevel. Relevant code here:
      https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java#L166-L182

      Attachments

        Activity

          People

            Unassigned Unassigned
            igozali Ivan Gozali
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: