Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14841

Replication - Phase 2

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.1.0
    • None
    • repl
    • None

    Description

      Per email sent out to the dev list, the current implementation of replication in hive has certain drawbacks, for instance :

      • Replication follows a rubberbanding pattern, wherein different tables/ptns can be in a different/mixed state on the destination, so that unless all events are caught up on, we do not have an equivalent warehouse. Thus, this only satisfies DR cases, not load balancing usecases, and the secondary warehouse is really only seen as a backup, rather than as a live warehouse that trails the primary.
      • The base implementation is a naive implementation, and has several performance problems, including a large amount of duplication of data for subsequent events, as mentioned in HIVE-13348, having to copy out entire partitions/tables when just a delta of files might be sufficient/etc. Also, using EXPORT/IMPORT allows us a simple implementation, but at the cost of tons of temporary space, much of which is not actually applied at the destination.

      Thus, to track this, we now create a new branch (repl2) and a uber-jira(this one) to track experimental development towards improvement of this situation.

      Attachments

        Issue Links

          1.
          Bootstrap support for replv2 Sub-task Closed Sushanth Sowmyan  
          2.
          Extend JSONMessageFactory to store additional information about metadata objects on different table events Sub-task Closed Vaibhav Gumashta  
          3.
          Extend JSONMessageFactory to store additional information about Partition metadata objects on different partition events Sub-task Resolved Vaibhav Gumashta  
          4.
          Create ReplDumpTask/ReplDumpWork for dumping out metadata Sub-task Resolved Vaibhav Gumashta  
          5.
          Make changes to ReplicationSemanticAnalyzer to dump and load events stored in metastore Sub-task Resolved Sushanth Sowmyan  
          6.
          Add junit test to test replication scenarios Sub-task Closed Sushanth Sowmyan  
          7.
          Capture additional metadata to replicate a simple insert at destination Sub-task Closed Vaibhav Gumashta  
          8.
          REPL LOAD & DUMP support for incremental CREATE_TABLE/ADD_PTN Sub-task Closed Sushanth Sowmyan  
          9.
          Add a FetchTask to REPL DUMP plan for reading dump uri, last repl id as ResultSet Sub-task Closed Vaibhav Gumashta  
          10.
          Add more specific error codes to ReplicationSemanticAnalyzer's SemanticExceptions Sub-task Resolved Vaibhav Gumashta  
          11.
          Improve the pathname returned by ReplicationSemanticAnalyzer.getNextDumpDir Sub-task Open Unassigned  
          12.
          Investigate TestHCatClientNotification#createTable test failure Sub-task Resolved Sushanth Sowmyan  
          13.
          Enhance REPL dump bootstrap to write out notifications that occurred while bootstrap was generating initial dump (implementing ReplicationSemanticAnalyzer.consolidateEvent) Sub-task Open Unassigned  
          14.
          Add new methods to MessageFactory API (corresponding to the ones added in JSONMessageFactory) Sub-task Closed Sushanth Sowmyan  
          15.
          REPL LOAD & DUMP support for incremental INSERT events Sub-task Closed Vaibhav Gumashta  
          16.
          Fix order guarantee of event executions for REPL LOAD Sub-task Closed Sushanth Sowmyan  
          17.
          ChangeManager for replication Sub-task Closed Daniel Dai  
          18.
          Capture additional metadata to replicate multi-table and dynamic partition inserts at destination Sub-task Resolved Vaibhav Gumashta  
          19.
          REPL LOAD & DUMP support for incremental DROP_TABLE/DROP_PTN Sub-task Closed Sushanth Sowmyan  
          20.
          Fix REPL DUMP/LOAD DROP_PTN so it works on non-string-ptn-key tables Sub-task Closed Vaibhav Gumashta  
          21.
          Add file + checksum list for create table/partition during notification creation (whenever relevant) Sub-task Closed Daniel Dai  
          22.
          Move notification filtering to metastore server side Sub-task Open Unassigned  
          23.
          REPL LOAD & DUMP support for INSERT events with change management Sub-task Closed Vaibhav Gumashta  
          24.
          REPL LOAD & DUMP support for incremental ALTER_TABLE/ALTER_PTN including renames Sub-task Closed Sushanth Sowmyan  
          25.
          Hooking ChangeManager to "drop table", "drop partition" Sub-task Closed Daniel Dai  
          26.
          Refactor/cleanup TestReplicationScenario Sub-task Resolved Sushanth Sowmyan  
          27.
          Repl rename support adds unnecessary duplication for non-rename alters Sub-task Open Sushanth Sowmyan  
          28.
          Update db/table repl.last.id at the end of REPL LOAD of a batch of events Sub-task Closed Sushanth Sowmyan  
          29.
          Add versioning/format mechanism to NOTIFICATION_LOG entries, expand MESSAGE size Sub-task Closed Sushanth Sowmyan  
          30.
          Replicate functions Sub-task Resolved Vaibhav Gumashta  
          31.
          Replicate views Sub-task Resolved Sankar Hariappan  
          32.
          Using ChangeManager to copy files in ReplCopyTask Sub-task Resolved Daniel Dai  
          33.
          Replicate Insert Overwrites, Dynamic Partition Inserts and Loads Sub-task Closed Sankar Hariappan  
          34.
          Optimize(reduce) the number of alter calls made to fix repl.last.id Sub-task Patch Available Sushanth Sowmyan  
          35.
          change REPL DUMP syntax to use "LIMIT" instead of "BATCH" keyword Sub-task Closed Sushanth Sowmyan  
          36.
          Event replication for constraints Sub-task Closed Daniel Dai

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          37.
          Bootstrap replication for constraint Sub-task Open Daniel Dai  
          38.
          Hive REPL STATUS is not returning last event ID Sub-task Resolved Sankar Hariappan  
          39.
          Incremental REPL LOAD Inserts doesn't operate on the target database if name differs from source database. Sub-task Resolved Sankar Hariappan  
          40.
          Support replication of truncate table Sub-task Closed Sankar Hariappan  
          41.
          REPL DUMP shows last event ID of the database even if we use LIMIT option. Sub-task Closed Sankar Hariappan  
          42.
          Incremental insert into a partitioned table doesn't get replicated. Sub-task Closed Sankar Hariappan  
          43.
          Replicate views with proper query string when perform REPL LOAD on a renamed database. Sub-task Open Aasha Medhi  
          44.
          Table level REPL LOAD doesn't return a valid dump path. Sub-task Resolved Sankar Hariappan  
          45.
          Test and support replication of exchange partition Sub-task Closed Sankar Hariappan  
          46.
          Hook Change Manager to Truncate Table. Sub-task Resolved Sankar Hariappan  
          47.
          Support replicating into existing db if the db is empty Sub-task Closed Sankar Hariappan  
          48.
          Add HS2 operation logs and improve logs for REPL commands Sub-task Closed Sankar Hariappan  
          49.
          New Events created as part of replv2 potentially break replv1 Sub-task Closed Sushanth Sowmyan  
          50.
          Hook Change Manager to Insert Overwrite Sub-task Closed Sankar Hariappan  
          51.
          Enable concurrent RENAME during bootstrap REPL DUMP Sub-task Resolved Sankar Hariappan  
          52.
          Bootstrap REPL DUMP shouldn't fail when table is dropped after fetching the table names. Sub-task Closed Sankar Hariappan  
          53.
          repl invocations of distcp needs additional handling Sub-task Closed Sushanth Sowmyan  
          54.
          Bootstrap REPL DUMP shouldn't fail when a partition is dropped/renamed when dump in progress. Sub-task Closed Sankar Hariappan  
          55.
          make Task Dependency on Repl Load more intuitive Sub-task Closed Anishek Agarwal  
          56.
          REPL DUMP for insert event should't fail if the table is already dropped. Sub-task Closed Sankar Hariappan  
          57.
          Support change management for rename table/partition. Sub-task Closed Sankar Hariappan  
          58.
          Ensure replication actions are idempotent if any series of events are applied again. Sub-task Closed Sankar Hariappan  
          59.
          Incremental REPL LOAD should load the events in the same sequence as it is dumped. Sub-task Closed Sankar Hariappan  
          60.
          Distcp optimization - One distcp per ReplCopyTask Sub-task Closed Sankar Hariappan  
          61.
          REPL LOAD should update last repl ID only after successful copy of data files. Sub-task Closed Sankar Hariappan  
          62.
          Ensure REPL DUMP and REPL LOAD are authorized properly Sub-task Closed Sushanth Sowmyan  
          63.
          Support replication of concatenate operation. Sub-task Closed Sankar Hariappan  
          64.
          Improve HS2 operation logs for REPL commands. Sub-task Closed Sankar Hariappan  
          65.
          Disable rename operations during bootstrap dump Sub-task Closed Sankar Hariappan  
          66.
          Long chain of tasks created by REPL LOAD shouldn't cause stack corruption. Sub-task Closed Sankar Hariappan  
          67.
          CM: ReplCopyTask should retain the original file names even if copied from CM path. Sub-task Closed Daniel Dai  
          68.
          Dynamic add partition by insert shouldn't generate INSERT event. Sub-task Closed Sankar Hariappan  
          69.
          EXPORT and IMPORT shouldn't perform distcp with doAs privileged user. Sub-task Closed Sankar Hariappan  
          70.
          REPL LOAD of ALTER_PARTITION event doesn't create import tasks if the partition doesn't exist during analyze phase. Sub-task Closed Sankar Hariappan  
          71.
          Bootstrap REPL DUMP throws exception if a partitioned table is dropped while reading partitions. Sub-task Closed Sankar Hariappan  
          72.
          Support replication for rename/move table across database Sub-task Closed Sankar Hariappan  
          73.
          REPL LOAD should overwrite the data files if exists instead of duplicating it Sub-task Closed Sankar Hariappan  
          74.
          Need to log bootstrap dump progress state property to HS2 logs. Sub-task Closed Sankar Hariappan  
          75.
          TestHCatClient should use hive.metastore.transactional.event.listeners as per recommendation. Sub-task Closed Sankar Hariappan  
          76.
          REPL LOAD need to use customised configurations to execute distcp/remote copy. Sub-task Closed Sankar Hariappan  
          77.
          Incremental REPL LOAD with Drop partition event on timestamp type partition column fails. Sub-task Closed Sankar Hariappan  
          78.
          "repl load" in bootstrap phase fails when partitions have whitespace Sub-task Closed Thejas Nair  
          79.
          Support replication for Alter Database operation. Sub-task Closed Sankar Hariappan  
          80.
          Data files deleted from temp table should not be recycled to CM path Sub-task Closed mahesh kumar behera  
          81.
          Replicate materialized views creation metadata with correct database name Sub-task Open Unassigned  
          82.
          Bootstrap REPL LOAD shall add tasks to create checkpoints for db/tables/partitions. Sub-task Closed Sankar Hariappan  
          83.
          Bootstrap REPL LOAD to use checkpoints to validate and skip the loaded data/metadata. Sub-task Closed Sankar Hariappan  
          84.
          Repl dump should not propagate the checkpoint and repl source properties Sub-task Closed Sankar Hariappan  
          85.
          Support replication of Materialized views Sub-task Open Aasha Medhi  

          Activity

            People

              sushanth Sushanth Sowmyan
              sushanth Sushanth Sowmyan
              Votes:
              2 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m