Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24974

Spark put all file's paths into SharedInMemoryCache even for unused partitions.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.1
    • None
    • SQL

    Description

      SharedInMemoryCache has all  filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions.

      I partitioned files by report_date and type and i have directory structure like 

      /custom_path/report_date=2018-07-24/type=A/file_1.parquet
      

       

      I am trying to execute 

      val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count
      

       

      In my query i need to load only files of type A and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. 

       

      This could be related to https://jira.apache.org/jira/browse/SPARK-17994 

      Attachments

        Activity

          People

            Unassigned Unassigned
            stanand99 andrzej.stankevich@gmail.com
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: