[SPARK-24974] Spark put all file's paths into SharedInMemoryCache even for unused partitions. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.1
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions.

I partitioned files by report_date and type and i have directory structure like

/custom_path/report_date=2018-07-24/type=A/file_1.parquet

I am trying to execute

val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count

In my query i need to load only files of type A and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions.

This could be related to https://jira.apache.org/jira/browse/SPARK-17994

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: andrzej.stankevich@gmail.com

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Jul/18 22:35

Updated:: 08/Oct/19 05:41

Resolved:: 08/Oct/19 05:41