Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22247

Hive partition filter very slow

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.0.2, 2.1.1
    • 2.3.0
    • SQL
    • None

    Description

      I found an issue where filtering partitions using a dataframe results in very bad performance.

      To reproduce:
      Create a hive table with a lot of partitions and write a spark query on that table that filters based on the partition column.

      In my use case I've got a table with about 30k partitions.
      I filter the partitions using some scala via spark-shell:
      table.filter("partition=x or partition=y")
      This results in a Hive thrift API call:{{ #get_partitions('db', 'table', -1)}} which is very slow (minutes) and loads all metastore partitions in memory.

      Doing a more simple filter:
      table.filter("partition=x)
      Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', 'partition = "x', -1)}} which is very fast (seconds) and only fetches partition X into memory.

      If possible Spark should translate the filter into the more performant Thrift call or fallback to a more scalable solution where it filters our partitions without having to loading them all into memory first (for instance fetching the partitions in batches).

      I've posted my original question on SO

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              patduin Patrick Duin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: