Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-16829 Über-jira: S3A Hadoop 3.3.1 features
  3. HADOOP-16756

distcp -update to S3A; abfs, etc always overwrites due to block size mismatch

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0
    • 3.0.4, 3.2.2, 3.3.1
    • fs/s3, tools/distcp
    • None

    Description

      Distcp over S3A always copies all source files no matter the files are changed or not. This is opposite to the statement in the doc below.

      http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

      And to use -update to only copy changed files.
      

      CopyMapper compares file length as well as block size before copying. While the file length should match, the block size does not. This is apparently because the returned block size from S3A is always 32MB.

      https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java#L348

      I'd suppose we should update the documentation or make code change.

      Attachments

        Issue Links

          Activity

            People

              stevel@apache.org Steve Loughran
              daisuke.kobayashi Daisuke Kobayashi
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: