Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16258

Suggestion: simplify type 2 SCDs with this non-standard extension to MERGE

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.2.0
    • None
    • Transactions
    • None

    Description

      Some common data maintenance strategies, especially the Type 2 SCD update, would become substantially easier with a small extension to the SQL standard for MERGE, specifically the ability to say "when matched then insert". Per the standard, matched records can only be updated or deleted.

      In the Type 2 SCD, when a new record comes in you update the old version of the record and insert the new version of the same record. If this extension were supported, sample Type 2 SCD code would look as follows:

      merge into customer
      using new_customer_stage stage
      on stage.source_pk = customer.source_pk
      when not matched then insert values    /* Insert a net new record */
        (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null)
      when matched then update set           /* Update an old record to mark it as out-of-date */
        is_current = false, end_date = current_date()
      when matched then insert values        /* Insert a new current record */
        (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null);
      

      Without this support, the user needs to devise some sort of workaround. A common approach is to first left join the staging table against the table to be updated, then to join these results to a helper table that will spit out two records for each match and one record for each miss. One of the matching records needs to have a join key that can never occur in the source data so this requires precise knowledge of the source dataset.

      An example of this:

      merge into customer
      using (
        select
          *,
          coalesce(invalid_key, source_pk) as join_key
        from (
          select
            stage.source_pk, stage.name, stage.state,
            case when customer.source_pk is null then 1
            when stage.name <> customer.name or stage.state <> customer.state then 2
            else 0 end as scd_row_type
          from
            new_customer_stage stage
          left join
            customer
          on (stage.source_pk = customer.source_pk and customer.is_current = true)
        ) updates
        join scd_types on scd_types.type = scd_row_type
      ) sub
      on sub.join_key = customer.source_pk
      when matched then update set
        is_current = false,
        end_date = current_date()
      when not matched then insert values
        (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true, null);
      select * from customer order by source_pk;
      

      This code is very complicated and will fail if the "invalid" key ever shows up in the source dataset. This simple extension provides a lot of value and likely very little maintenance overhead.

      /cc ekoifman

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cartershanklin Carter Shanklin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: