[BEAM-14540] Native implementation for serialized Rows to/from Arrow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: P2
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: sdk-py-core
Labels:
None

Description

With https://s.apache.org/batched-dofns (BEAM-14213), we want to encourage users to develop pipelines that process arrow data within the Python SDK, but communicating batches of data across SDKs or from SDK to Runner is left as future work. So when Arrow data is processed in the SDK, it must be converted to/from Rows for transmission over the Fn API. So the current ideal Python execution looks like:

1. read row oriented data over the Fn API, deserialize with SchemaCoder
2. Buffer rows and construct an arrow RecordBatch/Table object
3. Perform user computation(s)
4. Explode output RecordBatch/Table into rows
5. Serialize rows with SchemaCoder and write out over the Fn API

Note that (1,2) and (4,5) will exist in every stage of the user's pipeline, and they'll also exist when Python transforms (e.g. dataframe read_csv) are used in other SDKs. We should improve performance for this hot path by making a native (cythonized) implementation for (1,2) and (4,5).

Attachments

Issue Links

relates to

BEAM-14213 Add support for Batched DoFns in the Python SDK

Open

Activity

People

Assignee:: Unassigned

Reporter:: Brian Hulette

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 31/May/22 23:10

Updated:: 05/Jun/22 01:10