Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35801 SPIP: Row-level operations in Data Source V2
  3. SPARK-43775

DataSource V2: Allow representing updates as deletes and inserts

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.5.0
    • 3.5.0
    • SQL
    • None

    Description

      It may be beneficial for data sources to represent updates as deletes and inserts for delta-based implementations. Specifically, it may be helpful to properly distribute and order records on write. Remember that delete records have only row ID and metadata attributes set. Update records have data, row ID, metadata attributes set. Insert records have only data attributes set.

      For instance, a data source may rely on a metadata column _row_id (synthetic internally generated) to identify the row and is partitioned by bucket(product_id). Splitting updates into inserts and deletes would allow data sources to cluster all update and insert records for the same partition into a single task. Otherwise, the clustering key for updates and inserts will be different (updates have _row_id set). This is critical to reduce the number of generated files.

      Attachments

        Activity

          People

            aokolnychyi Anton Okolnychyi
            aokolnychyi Anton Okolnychyi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: