Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14841 Replication - Phase 2
  3. HIVE-17608

REPL LOAD should overwrite the data files if exists instead of duplicating it

    XMLWordPrintableJSON

Details

    Description

      This is to make insert event idempotent.

      Currently, MoveTask would create a new file if the destination folder contains a file of the same name. This is wrong if we have the same file in both bootstrap dump and incremental dump (by design, duplicate file in incremental dump will be ignored for idempotent reason), we will get duplicate files eventually. Also it is wrong to just retain the filename in the staging folder. Suppose we get the same insert event twice, the first time we get the file from source table folder, the second time we get the file from cm, we still end up with duplicate copy. The right solution is to keep the same file name as the source table folder.
      To do that, we can put the original filename in MoveWork, and in MoveTask, if original filename is set, don't generate a new name, simply overwrite. We need to do it in both bootstrap and incremental load.

      Attachments

        1. HIVE-17608.01.patch
          45 kB
          Sankar Hariappan
        2. HIVE-17608.02.patch
          45 kB
          Sankar Hariappan

        Issue Links

          Activity

            People

              sankarh Sankar Hariappan
              sankarh Sankar Hariappan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: