Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-7886

FetchAzureBlobStorage, FetchS3Object, and FetchGCSObject processors should be able to fetch ranges

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.12.0, 1.13.0
    • 1.14.0
    • Extensions

    Description

      Azure Blob Storage, AWS S3, and Google Cloud Storage all support retrieving byte ranges of stored objects.  Current versions of NiFi processors for these services do not support fetching by byte range.

      Allowing to fetch by range would allow multiple enhancements:

      • Parallelized downloads
        • Faster speeds if the bandwidth delay product of the connection is lower than the available bandwidth
        • Load distribution over a cluster
      • Cost savings
        • If the file is large and only part of the file is needed, the desired part of the file can be downloaded, saving bandwidth costs by not retrieving unnecessary bytes
        • Download failures would only need to retry the failed segment, rather than the full file
      • Download extremely large files
        • Ability to download files that are larger than the available content repo by downloading a segment and moving it off to a system with more capacity before downloading another segment

       

      Some of these enhancements would require an upstream processor to generate multiple flow files, each covering a different part of the overall range.  Something like this:
      ListS3 -> ExecuteGroovyScript (to split into multiple flow files with different range attributes) -> FetchS3Object.

      Attachments

        Issue Links

          Activity

            People

              pkelly.nifi Paul Kelly
              pkelly.nifi Paul Kelly
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h
                  5h