[SPARK-36529] Decouple CPU with IO work in vectorized Parquet reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.3.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Currently it seems the vectorized Parquet reader does almost everything in a sequential manner:
1. read the row group using file system API (perhaps from remote storage like S3)
2. allocate buffers and store those row group bytes into them
3. decompress the data pages
4. in Spark, decode all the read columns one by one
5. read the next row group and repeat from 1.

A lot of improvements can be done to decouple the IO and CPU intensive work. In addition, we could parallelize the row group loading and column decoding, and utilizing all the cores available for a Spark task.

Attachments

Issue Links

is related to

SPARK-35743 Improve Parquet vectorized reader

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Chao Sun

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 16/Aug/21 18:26

Updated:: 06/Jan/23 22:27