[ARROW-13797] [C++] Implement column projection pushdown to ORC reader in Datasets API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/29423

Description

~~ARROW-13572~~ (https://github.com/apache/arrow/pull/10991) added basic support for ORC file format in the Datasets API, but the reader still reads all columns regardless of the ScanOptions. Since ORC is a columnar format that supports reading only specific fields, we can optimize this step.

The tricky part is to convert the field name of the Arrow schema to the index in the ORC schema. Currently, this logic is included in the Python bindings (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59), but so this needs to be moved to C++.

Attachments

Issue Links

links to

GitHub Pull Request #11372

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Aug/21 16:11

Updated:: 11/Jan/23 08:35

Resolved:: 11/Oct/21 15:48

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 10m