Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14269 Performance optimizations for data on S3
  3. HIVE-15546

Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.3.0
    • Hive
    • None

    Description

      When running on blobstores (like S3) where metadata operations (like listStatus) are costly, Utilities.getInputPaths() can add significant overhead when setting up the input paths for an MR / Spark / Tez job.

      The method performs a listStatus on all input paths in order to check if the path is empty. If the path is empty, a dummy file is created for the given partition. This is all done sequentially. This can be really slow when there are a lot of empty partitions. Even when all partitions have input data, this can take a long time.

      We should either:

      (1) Just remove the logic to check if each input path is empty, and handle any edge cases accordingly.

      (2) Multi-thread the listStatus calls

      Attachments

        1. HIVE-15546.6.patch
          9 kB
          Sahil Takiar
        2. HIVE-15546.5.patch
          9 kB
          Sahil Takiar
        3. HIVE-15546.4.patch
          9 kB
          Sahil Takiar
        4. HIVE-15546.3.patch
          6 kB
          Sahil Takiar
        5. HIVE-15546.2.patch
          6 kB
          Sahil Takiar
        6. HIVE-15546.1.patch
          0.7 kB
          Sahil Takiar

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stakiar Sahil Takiar Assign to me
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment