Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.1
-
None
Description
SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions.
I partitioned files by report_date and type and i have directory structure like
/custom_path/report_date=2018-07-24/type=A/file_1.parquet
I am trying to execute
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count
In my query i need to load only files of type A and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions.
This could be related to https://jira.apache.org/jira/browse/SPARK-17994