Details
-
Sub-task
-
Status: Resolved
-
Minor
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
Some perf improvements which spring to mind having looked at the s3guard import command
Key points: it can handle the import of a tree with existing data better
- if the bucket is already under s3guard, then the listing will return all listed files, which will then be put() again.
- import calls putParentsIfNotPresent(), but DDBMetaStore.put() will do the parent creation anyway
- For each entry in the store (i.e. a file), the full parent listing is created, then a batch write created to put all the parents and the actual file
As a result, it's at risk of doing many more put calls than needed, especially for wide/deep directory trees.
It would be much more efficient to put all files in a single directory as part of 1+ batch request, with 1 parent tree. Better yet: a get() of that parent could skip the put of parent entries.
Attachments
Issue Links
- depends upon
-
HADOOP-15183 S3Guard store becomes inconsistent after partial failure of rename
- Resolved