Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14841

Replication - Phase 2

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.1.0
    • None
    • repl
    • None

    Description

      Per email sent out to the dev list, the current implementation of replication in hive has certain drawbacks, for instance :

      • Replication follows a rubberbanding pattern, wherein different tables/ptns can be in a different/mixed state on the destination, so that unless all events are caught up on, we do not have an equivalent warehouse. Thus, this only satisfies DR cases, not load balancing usecases, and the secondary warehouse is really only seen as a backup, rather than as a live warehouse that trails the primary.
      • The base implementation is a naive implementation, and has several performance problems, including a large amount of duplication of data for subsequent events, as mentioned in HIVE-13348, having to copy out entire partitions/tables when just a delta of files might be sufficient/etc. Also, using EXPORT/IMPORT allows us a simple implementation, but at the cost of tons of temporary space, much of which is not actually applied at the destination.

      Thus, to track this, we now create a new branch (repl2) and a uber-jira(this one) to track experimental development towards improvement of this situation.

      Attachments

        Issue Links

        1.
        Bootstrap support for replv2 Sub-task Closed Sushanth Sowmyan   Actions
        2.
        Extend JSONMessageFactory to store additional information about metadata objects on different table events Sub-task Closed Vaibhav Gumashta   Actions
        3.
        Extend JSONMessageFactory to store additional information about Partition metadata objects on different partition events Sub-task Resolved Vaibhav Gumashta   Actions
        4.
        Create ReplDumpTask/ReplDumpWork for dumping out metadata Sub-task Resolved Vaibhav Gumashta   Actions
        5.
        Make changes to ReplicationSemanticAnalyzer to dump and load events stored in metastore Sub-task Resolved Sushanth Sowmyan   Actions
        6.
        Add junit test to test replication scenarios Sub-task Closed Sushanth Sowmyan   Actions
        7.
        Capture additional metadata to replicate a simple insert at destination Sub-task Closed Vaibhav Gumashta   Actions
        8.
        REPL LOAD & DUMP support for incremental CREATE_TABLE/ADD_PTN Sub-task Closed Sushanth Sowmyan   Actions
        9.
        Add a FetchTask to REPL DUMP plan for reading dump uri, last repl id as ResultSet Sub-task Closed Vaibhav Gumashta   Actions
        10.
        Add more specific error codes to ReplicationSemanticAnalyzer's SemanticExceptions Sub-task Resolved Vaibhav Gumashta   Actions
        11.
        Improve the pathname returned by ReplicationSemanticAnalyzer.getNextDumpDir Sub-task Open Unassigned   Actions
        12.
        Investigate TestHCatClientNotification#createTable test failure Sub-task Resolved Sushanth Sowmyan   Actions
        13.
        Enhance REPL dump bootstrap to write out notifications that occurred while bootstrap was generating initial dump (implementing ReplicationSemanticAnalyzer.consolidateEvent) Sub-task Open Unassigned   Actions
        14.
        Add new methods to MessageFactory API (corresponding to the ones added in JSONMessageFactory) Sub-task Closed Sushanth Sowmyan   Actions
        15.
        REPL LOAD & DUMP support for incremental INSERT events Sub-task Closed Vaibhav Gumashta   Actions
        16.
        Fix order guarantee of event executions for REPL LOAD Sub-task Closed Sushanth Sowmyan   Actions
        17.
        ChangeManager for replication Sub-task Closed Daniel Dai   Actions
        18.
        Capture additional metadata to replicate multi-table and dynamic partition inserts at destination Sub-task Resolved Vaibhav Gumashta   Actions
        19.
        REPL LOAD & DUMP support for incremental DROP_TABLE/DROP_PTN Sub-task Closed Sushanth Sowmyan   Actions
        20.
        Fix REPL DUMP/LOAD DROP_PTN so it works on non-string-ptn-key tables Sub-task Closed Vaibhav Gumashta   Actions
        21.
        Add file + checksum list for create table/partition during notification creation (whenever relevant) Sub-task Closed Daniel Dai   Actions
        22.
        Move notification filtering to metastore server side Sub-task Open Unassigned   Actions
        23.
        REPL LOAD & DUMP support for INSERT events with change management Sub-task Closed Vaibhav Gumashta   Actions
        24.
        REPL LOAD & DUMP support for incremental ALTER_TABLE/ALTER_PTN including renames Sub-task Closed Sushanth Sowmyan   Actions
        25.
        Hooking ChangeManager to "drop table", "drop partition" Sub-task Closed Daniel Dai   Actions
        26.
        Refactor/cleanup TestReplicationScenario Sub-task Resolved Sushanth Sowmyan   Actions
        27.
        Repl rename support adds unnecessary duplication for non-rename alters Sub-task Open Sushanth Sowmyan   Actions
        28.
        Update db/table repl.last.id at the end of REPL LOAD of a batch of events Sub-task Closed Sushanth Sowmyan   Actions
        29.
        Add versioning/format mechanism to NOTIFICATION_LOG entries, expand MESSAGE size Sub-task Closed Sushanth Sowmyan   Actions
        30.
        Replicate functions Sub-task Resolved Vaibhav Gumashta   Actions
        31.
        Replicate views Sub-task Resolved Sankar Hariappan   Actions
        32.
        Using ChangeManager to copy files in ReplCopyTask Sub-task Resolved Daniel Dai   Actions
        33.
        Replicate Insert Overwrites, Dynamic Partition Inserts and Loads Sub-task Closed Sankar Hariappan   Actions
        34.
        Optimize(reduce) the number of alter calls made to fix repl.last.id Sub-task Patch Available Sushanth Sowmyan   Actions
        35.
        change REPL DUMP syntax to use "LIMIT" instead of "BATCH" keyword Sub-task Closed Sushanth Sowmyan   Actions
        36.
        Event replication for constraints Sub-task Closed Daniel Dai

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        Actions
        37.
        Bootstrap replication for constraint Sub-task Open Daniel Dai   Actions
        38.
        Hive REPL STATUS is not returning last event ID Sub-task Resolved Sankar Hariappan   Actions
        39.
        Incremental REPL LOAD Inserts doesn't operate on the target database if name differs from source database. Sub-task Resolved Sankar Hariappan   Actions
        40.
        Support replication of truncate table Sub-task Closed Sankar Hariappan   Actions
        41.
        REPL DUMP shows last event ID of the database even if we use LIMIT option. Sub-task Closed Sankar Hariappan   Actions
        42.
        Incremental insert into a partitioned table doesn't get replicated. Sub-task Closed Sankar Hariappan   Actions
        43.
        Replicate views with proper query string when perform REPL LOAD on a renamed database. Sub-task Open Aasha Medhi   Actions
        44.
        Table level REPL LOAD doesn't return a valid dump path. Sub-task Resolved Sankar Hariappan   Actions
        45.
        Test and support replication of exchange partition Sub-task Closed Sankar Hariappan   Actions
        46.
        Hook Change Manager to Truncate Table. Sub-task Resolved Sankar Hariappan   Actions
        47.
        Support replicating into existing db if the db is empty Sub-task Closed Sankar Hariappan   Actions
        48.
        Add HS2 operation logs and improve logs for REPL commands Sub-task Closed Sankar Hariappan   Actions
        49.
        New Events created as part of replv2 potentially break replv1 Sub-task Closed Sushanth Sowmyan   Actions
        50.
        Hook Change Manager to Insert Overwrite Sub-task Closed Sankar Hariappan   Actions
        51.
        Enable concurrent RENAME during bootstrap REPL DUMP Sub-task Resolved Sankar Hariappan   Actions
        52.
        Bootstrap REPL DUMP shouldn't fail when table is dropped after fetching the table names. Sub-task Closed Sankar Hariappan   Actions
        53.
        repl invocations of distcp needs additional handling Sub-task Closed Sushanth Sowmyan   Actions
        54.
        Bootstrap REPL DUMP shouldn't fail when a partition is dropped/renamed when dump in progress. Sub-task Closed Sankar Hariappan   Actions
        55.
        make Task Dependency on Repl Load more intuitive Sub-task Closed Anishek Agarwal   Actions
        56.
        REPL DUMP for insert event should't fail if the table is already dropped. Sub-task Closed Sankar Hariappan   Actions
        57.
        Support change management for rename table/partition. Sub-task Closed Sankar Hariappan   Actions
        58.
        Ensure replication actions are idempotent if any series of events are applied again. Sub-task Closed Sankar Hariappan   Actions
        59.
        Incremental REPL LOAD should load the events in the same sequence as it is dumped. Sub-task Closed Sankar Hariappan   Actions
        60.
        Distcp optimization - One distcp per ReplCopyTask Sub-task Closed Sankar Hariappan   Actions
        61.
        REPL LOAD should update last repl ID only after successful copy of data files. Sub-task Closed Sankar Hariappan   Actions
        62.
        Ensure REPL DUMP and REPL LOAD are authorized properly Sub-task Closed Sushanth Sowmyan   Actions
        63.
        Support replication of concatenate operation. Sub-task Closed Sankar Hariappan   Actions
        64.
        Improve HS2 operation logs for REPL commands. Sub-task Closed Sankar Hariappan   Actions
        65.
        Disable rename operations during bootstrap dump Sub-task Closed Sankar Hariappan   Actions
        66.
        Long chain of tasks created by REPL LOAD shouldn't cause stack corruption. Sub-task Closed Sankar Hariappan   Actions
        67.
        CM: ReplCopyTask should retain the original file names even if copied from CM path. Sub-task Closed Daniel Dai   Actions
        68.
        Dynamic add partition by insert shouldn't generate INSERT event. Sub-task Closed Sankar Hariappan   Actions
        69.
        EXPORT and IMPORT shouldn't perform distcp with doAs privileged user. Sub-task Closed Sankar Hariappan   Actions
        70.
        REPL LOAD of ALTER_PARTITION event doesn't create import tasks if the partition doesn't exist during analyze phase. Sub-task Closed Sankar Hariappan   Actions
        71.
        Bootstrap REPL DUMP throws exception if a partitioned table is dropped while reading partitions. Sub-task Closed Sankar Hariappan   Actions
        72.
        Support replication for rename/move table across database Sub-task Closed Sankar Hariappan   Actions
        73.
        REPL LOAD should overwrite the data files if exists instead of duplicating it Sub-task Closed Sankar Hariappan   Actions
        74.
        Need to log bootstrap dump progress state property to HS2 logs. Sub-task Closed Sankar Hariappan   Actions
        75.
        TestHCatClient should use hive.metastore.transactional.event.listeners as per recommendation. Sub-task Closed Sankar Hariappan   Actions
        76.
        REPL LOAD need to use customised configurations to execute distcp/remote copy. Sub-task Closed Sankar Hariappan   Actions
        77.
        Incremental REPL LOAD with Drop partition event on timestamp type partition column fails. Sub-task Closed Sankar Hariappan   Actions
        78.
        "repl load" in bootstrap phase fails when partitions have whitespace Sub-task Closed Thejas Nair   Actions
        79.
        Support replication for Alter Database operation. Sub-task Closed Sankar Hariappan   Actions
        80.
        Data files deleted from temp table should not be recycled to CM path Sub-task Closed mahesh kumar behera   Actions
        81.
        Replicate materialized views creation metadata with correct database name Sub-task Open Unassigned   Actions
        82.
        Bootstrap REPL LOAD shall add tasks to create checkpoints for db/tables/partitions. Sub-task Closed Sankar Hariappan   Actions
        83.
        Bootstrap REPL LOAD to use checkpoints to validate and skip the loaded data/metadata. Sub-task Closed Sankar Hariappan   Actions
        84.
        Repl dump should not propagate the checkpoint and repl source properties Sub-task Closed Sankar Hariappan   Actions
        85.
        Support replication of Materialized views Sub-task Open Aasha Medhi   Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sushanth Sushanth Sowmyan Assign to me
            sushanth Sushanth Sowmyan

            Dates

              Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 20m
              20m

              Slack

                Issue deployment