[SPARK-42256] SPIP: Lazy Materialization for Parquet Read Performance Improvement - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.0
Fix Version/s: None
Component/s: SQL
Labels:
- SPIP

Description

Spark-SQL filter operation is a common workload in order to select specific rows from persisted data. The current implementation of Spark requires the read values to materialize (i.e. de-compress, de-code, etc...) onto memory first before applying the filters. This approach means that the filters may eventually throw away many values, resulting in wasted computations. Alternatively, evaluating the filters first and lazily materializing only the used values can save waste and improve the read performance. Lazy materialization has been employed by other distributed SQL engines such as Velox and Presto/Trino, but this approach has not yet been extended to Spark with Parquet.

SPIP: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME

Attachments

Issue Links

is related to

SPARK-36527 Implement lazy materialization for the vectorized Parquet reader

Open

Activity

People

Assignee:: Unassigned

Reporter:: Kazuyuki Tanimura

Shepherd:: L. C. Hsieh

Votes:: 1 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 31/Jan/23 09:41

Updated:: 31/Jan/23 19:22