Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45373

Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0
    • None
    • SQL

    Description

      In the rule PruneFileSourcePartitions where the CatalogFileIndex gets converted to InMemoryFileIndex, the HMS calls can get very expensive if :
      1) The translated filter string for push down to HMS layer becomes empty , resulting in fetching of all partitions and same table is referenced multiple times in the query.
      2) Or just in case same table is referenced multiple times in the query with different partition filters.
      In such cases current code would result in multiple calls to HMS layer.
      This can be avoided by grouping the tables based on CatalogFileIndex and passing a common minimum filter ( filter1 || filter2) and getting a base PrunedInmemoryFileIndex which can become a basis for each of the specific table.

      Opened following PR for ticket:
      SPARK-45373-PR

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            ashahid7 Asif

            Dates

              Created:
              Updated:

              Slack

                Issue deployment