This is a request for a feature, that enables SparkSQL to obtain the files for a Hive partition, by calling inputFormat.getSplits(), as opposed to listing files directly, while still using Spark's optimized Parquet readers for actual IO. (Note that the difference between this and falling back entirely to Hive via spark.sql.hive.convertMetastoreParquet=false is that we get to realize benefits such as new parquet reader, schema merging etc in SparkSQL)
Some background the context, using our use-case at Uber. We have Hive tables, where each partition contains versioned files (whenever records in a file change, we produce a new version, to speed up database ingestion) and such tables are registerd with a custom InputFormat that just filters out old versions and just returns the latest version of each file to the query.
We have this working for 5 months now across Hive/Spark/Presto as follows
- Hive : Works out of box, by calling the inputFormat.getSplits, so we are good there
- Presto: We made the fix in Presto, similar to whats proposed here.
- Spark : We set convertMetastoreParquet=false. Perf is actually comparable for our use-cases, but we run into schema merging issues now and then.
we have explored a few approaches here and would like to get more feedback from you all, before we go further..