Description
If you use distcp -update to an adl or wasb store, repeatedly, all the source files are copied up every time. In contrast, if you use hdfs:// or s3a:// as a destination, only the new ones are uploaded. hdfs uses checksums for a diff, but s3a is just returning file length and relying on distcp logic being "if either src or dest doesn't do checksums, only compare file len"
somehow that's not kicking in. Tested for file: and hdfs sources, wasb and adl dests
Attachments
Issue Links
- duplicates
-
HADOOP-16756 distcp -update to S3A; abfs, etc always overwrites due to block size mismatch
- Resolved
- is depended upon by
-
HADOOP-15788 Improve Distcp for long-haul/cloud deployments
- Open