Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-18028

High performance S3A input stream with prefetching & caching

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 3.3.9
    • fs/s3

    Description

      I work for Pinterest. I developed a technique for vastly improving read throughput when reading from the S3 file system. It not only helps the sequential read case (like reading a SequenceFile) but also significantly improves read throughput of a random access case (like reading Parquet). This technique has been very useful in significantly improving efficiency of the data processing jobs at Pinterest. 
       
      I would like to contribute that feature to Apache Hadoop. More details on this technique are available in this blog I wrote recently:
      https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0
       

      Attachments

        Issue Links

        1.
        test failures with prefetching s3a input stream Sub-task Resolved Monthon Klongklaew

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 50m
        Actions
        2.
        s3a prefetching stream to move off twitter FuturePool Sub-task Resolved Unassigned   Actions
        3.
        document use and architecture design of prefetching s3a input stream Sub-task Resolved Ahmar Suhail

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 40m
        Actions
        4.
        Remove use of scala jar twitter util-core with java futures in S3A prefetching stream Sub-task Resolved PJ Fanning

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 9h
        Actions
        5.
        move org.apache.hadoop.fs.common package into hadoop-common module Sub-task Resolved Steve Loughran

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 20m
        Actions
        6.
        S3File to store reference to active S3Object in a field. Sub-task Resolved Bhalchandra Pandit   Actions
        7.
        s3a prefetching stream to support unbuffer() Sub-task In Progress Steve Loughran   Actions
        8.
        tune logging of prefetch problems Sub-task Open Unassigned   Actions
        9.
        s3a prefetching to use SemaphoredDelegatingExecutor for submitting work Sub-task Resolved Viraj Jasani   Actions
        10.
        Convert s3a prefetching to use JavaDoc for fields and enums Sub-task Resolved Steve Loughran   Actions
        11.
        S3PrefetchingInputStream to support status probes when closed Sub-task Resolved Viraj Jasani   Actions
        12.
        Collect IOStatistics during S3A prefetching Sub-task Resolved Ahmar Suhail

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 4h 10m
        Actions
        13.
        Ensure S3A prefetching stream memory consumption scales Sub-task Open Unassigned   Actions
        14.
        stream warns Not all bytes were read from the S3ObjectInputStream when closed Sub-task Resolved Ahmar Suhail

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 40m
        Actions
        15.
        Use async drain threshold to decide b/w async and sync draining Sub-task Resolved Ahmar Suhail   Actions
        16.
        tests in ITestS3AInputStreamPerformance are failing Sub-task Resolved Ahmar Suhail

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 6h
        Actions
        17.
        Rebase s3a prefetching feature branch on top of trunk Sub-task Resolved Ahmar Suhail   Actions
        18.
        Remove lower limit on s3a prefetching/caching block size Sub-task Resolved Ankit Saurabh   Actions
        19.
        Tests in ITestS3AOpenCost are failing Sub-task Resolved Ahmar Suhail   Actions
        20.
        Add in configuration option to enable prefetching Sub-task Resolved Ahmar Suhail

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 10m
        Actions
        21.
        Review s3a prefetching input stream retry code; synchronization Sub-task Open Unassigned   Actions
        22.
        S3A prefetch - Implement LRU cache for SingleFilePerBlockCache Sub-task Resolved Viraj Jasani   Actions
        23.
        Update class names to be clear they belong to S3A prefetching Sub-task Resolved Unassigned   Actions
        24.
        S3A prefetching: Error logging during reads Sub-task Resolved Ankit Saurabh   Actions
        25.
        hadoop-aws maven build to add a prefetch profile to run all tests with prefetching Sub-task Resolved Viraj Jasani   Actions
        26.
        Implement readFully(long position, byte[] buffer, int offset, int length) Sub-task Resolved Alessandro Passaro   Actions
        27.
        rebase feature/HADOOP-18028-s3a-prefetch to trunk Sub-task Resolved Steve Loughran

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h
        Actions
        28.
        fs.s3a.prefetch.block.size to be read through longBytesOption Sub-task Resolved Viraj Jasani   Actions
        29.
        ITestS3AFileSystemStatistic failure in prefetch feature branch Sub-task Open Samrat Deb   Actions
        30.
        ITestS3ACannedACLs failure; not in a span Sub-task Resolved Ashutosh Gupta   Actions
        31.
        S3A Prefetch - SingleFilePerBlockCache to use LocalDirAllocator Sub-task Resolved Viraj Jasani   Actions
        32.
        s3a prefetching Executor should be closed Sub-task Resolved Viraj Jasani   Actions
        33.
        assertion failure in ITestS3APrefetchingInputStream Sub-task Resolved Ashutosh Gupta   Actions
        34.
        Fix transient failure of ITestS3APrefetchingInputStream#testRandomReadLargeFile Sub-task Resolved Viraj Jasani   Actions
        35.
        Backport S3A prefetching stream to branch-3.3 Sub-task Resolved Steve Loughran   Actions
        36.
        s3a prefetch cache blocks should be accessed by RW locks Sub-task Resolved Viraj Jasani   Actions
        37.
        CachingBlockManager to use AtomicBoolean for closed flag Sub-task Resolved Viraj Jasani   Actions
        38.
        S3A prefetching: switch to prefetching for chosen read policies Sub-task Open Unassigned   Actions
        39.
        s3a prefetching to use split start/end options to limit prefetch range Sub-task In Progress Steve Loughran   Actions
        40.
        s3a large file prefetch tests are too slow, don't validate data Sub-task Resolved Viraj Jasani   Actions
        41.
        s3a prefetch read/write file operations should guard channel close Sub-task Resolved Viraj Jasani   Actions
        42.
        s3a prefetch LRU cache eviction metric Sub-task Resolved Viraj Jasani   Actions
        43.
        S3ACachingInputStream.ensureCurrentBuffer(): lazy seek means all reads look like random IO Sub-task Open Unassigned   Actions
        44.
        ITestS3APrefetchingCacheFiles teardown failure if setup() fails Sub-task Open Unassigned   Actions
        45.
        Use builder for prefetch CachingBlockManager Sub-task Resolved Viraj Jasani   Actions
        46.
        S3A prefetching to support Vector IO Sub-task Open Unassigned   Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bhalchandrap Bhalchandra Pandit
            bhalchandrap Bhalchandra Pandit

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 44h 20m
                44h 20m

                Slack

                  Issue deployment