Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17566 Über-jira: S3A Hadoop 3.3.2 features
  3. HADOOP-16189

S3A copy/rename of large files to be parallelized as a multipart operation

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Done
    • 3.2.0
    • 3.3.2
    • fs/s3
    • None

    Description

      AWS docs on copying

      • file < 5GB, can do this as a single operation
      • file > 5GB you MUST use multipart API.

      But even for files < 5GB, that's a really slow operation. And if HADOOP-16188 is to be believed, there's not enough retrying.
      Even if the transfer manager does swtich to multipart copies at some size, just as we do our writes in 32-64 MB blocks, we can do the same for file copy. Something like

      l = len(src)
      if L < fs.s3a.block.size: 
         single copy
      else: 
        split file by blocks, initiate the upload, then execute each block copy as an operation in the S3A thread pool; once all done: complete the operation.
      

      + do retries on individual blocks copied, so a failure of one to copy doesn't force retry of the whole upload.

      This is potentially more complex than it sounds, as

      • there's the need to track the ongoing copy operational state
      • handle failures (abort, etc)
      • use the if-modified/version headers to fail fast if the source file changes partway through copy
      • if the len(file)/fs.s3a.block.size > max-block-count, use a bigger block size
      • Maybe need to fall back to the classic operation

      Overall, what sounds simple could get complex fast, or at least a bigger piece of code. Needs to have some PoC of speedup before attempting

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: