Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42256

SPIP: Lazy Materialization for Parquet Read Performance Improvement

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0
    • None
    • SQL

    Description

      Spark-SQL filter operation is a common workload in order to select specific rows from persisted data. The current implementation of Spark requires the read values to materialize (i.e. de-compress, de-code, etc...) onto memory first before applying the filters. This approach means that the filters may eventually throw away many values, resulting in wasted computations. Alternatively, evaluating the filters first and lazily materializing only the used values can save waste and improve the read performance. Lazy materialization has been employed by other distributed SQL engines such as Velox and Presto/Trino, but this approach has not yet been extended to Spark with Parquet.

      SPIP: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kazuyukitanimura Kazuyuki Tanimura
              L. C. Hsieh L. C. Hsieh
              Votes:
              1 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated: