Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • None

    Description

      During bulk import, inspecting files to determine where they go is expensive and slow.  In order to spread the cost, Accumulo has an internal mechanism to spread the work of inspecting files to random tablet servers.  Because this internal process takes time and consumes resources on the cluster, users want control over it.  The best way to give this control may be to externalize it by allowing bulk imports to have a mapping file.  This mapping file would specify the ranges where files should be loaded.  If Accumulo provided API to help produce this file, then that work could be done in Map Reduce or Spark.  This would give users all the control they want over when and where this computation is done.  This would naturally fit in the process used to create the bulk files. 

      To make bulk import fast this mapping file should have the following properties.

      • Key in file is a range
      • Value in file is a list of files
      • Ranges are non overlapping
      • File is sorted by range/key
      • Has a mapping for every non-empty file in the bulk import directory.

      If Accumulo provides APIs to do the following operation, then producing the file could written as a map/reduce job.

      • For a given rfile produce a list of row ranges where the file should be loaded.  These row ranges would be based on tablets.
      • Merge row range,list of file pairs
      • Serialize row range,list of files pairs

      With a mapping file, the bulk import algorithm could be written as follows.  This could all be executed in the master with no need to run inspection task on random tablet servers.

      • Sanity check file
        • Ensure in sorted order
        • Ensure ranges are non-overlapping
        • Ensure each file in directory has at least one entry in file
        • Ensure all splits in the file exist in the table.
      • Since file is sorted can do a merged read of file and metadata table, looping over the following operations for each tablet until all files are loaded.
        • Read the loaded files for the tablet
        • Read the files to load for the range
        • For any files not loaded, send an async load message to the tablet server

      The above algorithm can just keep scanning the metadata table and sending async load messages until the bulk import is complete.  Since the load messages are async, the bulk load could of a large number of files could potentially be very fast.

      The bulk load operation can easily handle the case of tablets splitting during the operation by matching a single range in the file to multiple tablets.  However attempting to handle merges would be a lot more tricky.  It would probably be simplest to fail the operation if a merge is detected.  The nice thing is that this can be done in a very clean way.   Once the bulk import operation has the table lock, merges can not happen.  So after getting the table lock the bulk import operation can ensure all splits in the file exist in the table. The operation can abort if the condition is not met before doing any work.  If this condition is not met, it indicates a merge happened between generating the mapping file an doing the bulk import.

      Hopefully the mapping file plus the algorithm that sends async load messages can dramatically speed up bulk import operations.  This may lessen the need for other things like prioritizing bulk import.  To measure this, it would be very useful create a bulk import performance test that can create many files with very little data and measure the time it takes load them.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kturner Keith Turner
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 5h 50m
                5h 50m