Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34204

When use input_file_name() func all column from file appeared in physical plan of query, not only projection.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.7
    • 3.1.1
    • SQL
    • None

    Description

      input_file_name() function damage applying projection to the physical plan of the query.
      if use this function and a new column, column-oriented formats like parquet and orc put all columns to Physical plan.
      While without it, only selected columns uploaded.
      In my case, performance influence is x30.

      import org.apache.spark.sql.SparkSession
      import org.apache.spark.sql.functions._
      
      object TestSize {
      
        def main(args: Array[String]): Unit = {
          implicit val spark: SparkSession = SparkSession.builder()
            .master("local")
            .config("spark.sql.shuffle.partitions", "5")
            .getOrCreate()
      
          import spark.implicits._
      
          val query1 = spark.read.parquet(
            "s3a://part-00040-a19f0d20-eab3-48ef-be5a-602c7f9a8e58.c000.gz.parquet"
          )
            .select($"app_id", $"idfa", input_file_name().as("fileName"))
            .distinct()
            .count()
      
         val query2 = spark.read.parquet( "s3a://part-00040-a19f0d20-eab3-48ef-be5a- 602c7f9a8e58.c000.gz.parquet" ) 
            .select($"app_id", $"idfa")
            .distinct() 
            .count()
      
          Thread.sleep(10000000000L)
      
        }
      
      }
      

      `query1` has all columns in the physical plan, while `query2` only two.

      Attachments

        Activity

          People

            Unassigned Unassigned
            hryhoriev.nick Nick Hryhoriev
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: