[SPARK-43775] DataSource V2: Allow representing updates as deletes and inserts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.5.0
Fix Version/s: 3.5.0
Component/s: SQL
Labels:
None

Description

It may be beneficial for data sources to represent updates as deletes and inserts for delta-based implementations. Specifically, it may be helpful to properly distribute and order records on write. Remember that delete records have only row ID and metadata attributes set. Update records have data, row ID, metadata attributes set. Insert records have only data attributes set.

For instance, a data source may rely on a metadata column _row_id (synthetic internally generated) to identify the row and is partitioned by bucket(product_id). Splitting updates into inserts and deletes would allow data sources to cluster all update and insert records for the same partition into a single task. Otherwise, the clustering key for updates and inserts will be different (updates have _row_id set). This is critical to reduce the number of generated files.

Attachments

Issue Links

links to

[Github] Pull Request #41300 (aokolnychyi)

Activity

People

Assignee:: Anton Okolnychyi

Reporter:: Anton Okolnychyi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/May/23 15:31

Updated:: 31/May/23 15:51

Resolved:: 31/May/23 15:51