Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17861

Store data source partitions in metastore and push partition pruning into metastore

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 2.1.0
    • SQL
    • None

    Description

      Initially, Spark SQL does not store any partition information in the catalog for data source tables, because initially it was designed to work with arbitrary files. This, however, has a few issues for catalog tables:

      1. Listing partitions for a large table (with millions of partitions) can be very slow during cold start.
      2. Does not support heterogeneous partition naming schemes.
      3. Cannot leverage pushing partition pruning into the metastore.

      This ticket tracks the work required to push the tracking of partitions into the metastore. This change should be feature flagged.

      Attachments

        Issue Links

          1.
          Load only catalog table partition metadata required to answer a query Sub-task Resolved Michael MacFadden
          2.
          Feature flag SPARK-16980 Sub-task Resolved Eric Liang
          3.
          Refactor FileCatalog classes to simplify the inheritance tree Sub-task Resolved Eric Liang
          4.
          Fix refreshByPath for converted Hive tables Sub-task Resolved Eric Liang
          5.
          Enable metastore partition pruning for unconverted hive tables by default Sub-task Resolved Eric Liang
          6.
          Add back a file status cache for catalog tables Sub-task Resolved Eric Liang
          7.
          should not always lowercase partition columns of partition spec in parser Sub-task Resolved Wenchen Fan
          8.
          Use metastore for managing filesource table partitions as well Sub-task Resolved Wenchen Fan
          9.
          put hive serde table schema to table properties like data source table Sub-task Resolved Wenchen Fan
          10.
          Can't filter over mixed case parquet columns of converted Hive tables Sub-task Resolved Wenchen Fan
          11.
          Optimize insert to not require REPAIR TABLE Sub-task Resolved Eric Liang
          12.
          ExternalCatalogSuite should test with mixed case fields Sub-task Resolved Wenchen Fan
          13.
          Avoid using Union to chain together create table and repair partition commands Sub-task Resolved Eric Liang
          14.
          INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition Sub-task Resolved Eric Liang
          15.
          INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations Sub-task Resolved Eric Liang
          16.
          data source tables should support truncating partition Sub-task Resolved Wenchen Fan
          17.
          HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false Sub-task Resolved Michael MacFadden
          18.
          Rename partitionProviderIsHive -> tracksPartitionsInCatalog Sub-task Resolved Reynold Xin
          19.
          ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names Sub-task Resolved Wenchen Fan
          20.
          correct several partition related behaviours of ExternalCatalog Sub-task Resolved Wenchen Fan
          21.
          Revert hacks in parquet and orc reader to support case insensitive resolution Sub-task Resolved Eric Liang
          22.
          Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions Sub-task Resolved Eric Liang
          23.
          Update documentation for hive partition management in 2.1 Sub-task Resolved Eric Liang
          24.
          Append with df.saveAsTable writes data to wrong location Sub-task Resolved Eric Liang
          25.
          Major performance regression in SHOW PARTITIONS on partitioned Hive tables Sub-task Resolved Wenchen Fan
          26.
          Verify number of hive client RPCs in PartitionedTablePerfStatsSuite Sub-task Resolved Eric Liang
          27.
          Partition name/values not escaped correctly in some cases Sub-task Resolved Eric Liang
          28.
          Regression in file listing performance Sub-task Resolved Eric Liang
          29.
          Incorrect behaviors in overwrite table for datasource tables Sub-task Resolved Eric Liang
          30.
          Creating a partitioned datasource table should not scan all files for table Sub-task Resolved Eric Liang
          31.
          Return Nothing when Querying a Partitioned Data Source Table without Repairing it Sub-task Closed Unassigned

          Activity

            People

              ekhliang Eric Liang
              rxin Reynold Xin
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: