Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-242 [RFC-12] Support Efficient bootstrap of large parquet datasets to Hudi
  3. HUDI-419

Basic Implementation for verifying if bootstrapping works end to end

    XMLWordPrintableJSON

Details

    Description

      As part of prototyping, I have most of the core functionalities in 

      https://github.com/bvaradar/hudi/tree/vb_bootstrap

       

      This includes:

      1. Timeline and FileSystem View changes
      2. New Bootstrap Client to perform Bootstrap
      3. DeltaStreamer Integration
      4. Hive Parquet Read Optimized reader integration

       

      Needs to be done:

      1. Merge Handle changes to support upsert over bootstrap file slice (Read part similar to that of (4) functionally and write part same as that of current Hoodie MergeHandle.
      2. Unit Testing 
      3. Code cleanup as the current implementation has duplicated code.
      4. Automated integration test
      5. Hoodie CLI and Spark DataSource Write integration

      Attachments

        Issue Links

          Activity

            People

              vbalaji Balaji Varadarajan
              vbalaji Balaji Varadarajan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m