Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-242

[RFC-12] Support Efficient bootstrap of large parquet datasets to Hudi

    XMLWordPrintableJSON

Details

    Description

       Support Efficient bootstrap of large parquet tables

      Attachments

        Issue Links

          1.
          Basic Implementation for verifying if bootstrapping works end to end Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 10m
          2.
          Refactor HoodieWriteClient so that commit logic can be shareable by both bootstrap and normal write operations Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          3.
          Bootstrap Index - Implementation Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 10m
          4.
          Cleanup bootstrap code and create PR for FileStystemView changes Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 240h
          5.
          Cleanup bootstrap code and create write APIs for supporting bootstrap Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 96h
          6.
          Implement Hive Query Side Integration for querying tables containing bootstrap file slices Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 336h
          7.
          Implement support for bootstrapping in HoodieDeltaStreamer Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 168h
          8.
          Implement Spark DataSource Support for querying bootstrapped tables Sub-task Resolved Udit Mehrotra

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 10m
          9.
          Automated end to end Integration Test Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 72h
          10.
          Implement upsert functionality for handling updates to these bootstrap file slices Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 168h
          11.
          Implement CLI support for performing bootstrap Sub-task Resolved Wenning Ding

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 168h
          12.
          Long Running Testing to certify Bootstrapping Sub-task Resolved Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 168h
          13.
          Hive Sync Integration of bootstrapped table Sub-task Closed Udit Mehrotra

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 72h
          14.
          Implement support for bootstrapping via Spark datasource API Sub-task Closed Udit Mehrotra

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 336h
          15.
          Spark DS Support for incremental queries for bootstrapped tables Sub-task Closed Udit Mehrotra

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 120h
          16.
          Add a knob to change partition-path style while performing metadata bootstrap Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 24h
          17.
          Metadata Bootstrap Key Generator needs to handle complex keys correctly Sub-task Closed Balaji Varadarajan

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 24h
          18.
          Metadata Bootstrap Query Testing Master TIcket Sub-task Closed Balaji Varadarajan  
          19.
          Test COW : Hive Read Optimized Query with metadata bootstrap Sub-task Resolved Balaji Varadarajan  
          20.
          Test MOR : Hive Read Optimized Query with metadata bootstrap Sub-task Resolved Balaji Varadarajan  
          21.
          Test MOR : Spark SQL Read Optimized Query with metadata bootstrap Sub-task Resolved Balaji Varadarajan  
          22.
          Test MOR : Spark SQL Realtime Query with metadata bootstrap Sub-task Resolved Balaji Varadarajan  
          23.
          Test MOR : Hive Realtime Query with metadata bootstrap Sub-task Resolved Wenning Ding

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 72h
          24.
          Test COW : Spark SQL Read Optimized Query with metadata bootstrap Sub-task Resolved Wenning Ding

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 72h
          25.
          Test COW : Spark Data Source Read Optimized Queries Sub-task Resolved Udit Mehrotra  
          26.
          Web documentation for explaining how to bootstrap Sub-task Closed Balaji Varadarajan  
          27.
          Open Questions before merging Bootstrap Sub-task Resolved Balaji Varadarajan  
          28.
          Support for cleaning source data Sub-task Resolved Wenning Ding  
          29.
          Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name Sub-task Closed Wenning Ding  
          30.
          Bootstrap Implementation Bugs Sub-task Resolved Unassigned  
          31.
          Parallelize listing of Source dataset partitions Sub-task Resolved Udit Mehrotra  
          32.
          Address performance issues with finalizing writes on S3 Sub-task Closed Udit Mehrotra  
          33.
          Separate out Spark and Path detection utilities used in Bootstrap datasource work Sub-task Closed Udit Mehrotra  
          34.
          Hudi changes for bootstrapped tables integration with Presto Sub-task Resolved Udit Mehrotra  

          Activity

            People

              vinoth Vinoth Chandar
              vbalaji Balaji Varadarajan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2,136h 50m
                  2,136h 50m