Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39993

Spark on Kubernetes doesn't filter data by date

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.2
    • None
    • Kubernetes
    • Kubernetes v1.23.6

      Spark 3.2.2

      Java 1.8.0_312

      Python 3.9.13

      Aws dependencies:
      aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar

    Description

      I'm creating a Dataset with type date and saving it into s3. When I read it and try to use where() clause, I've noticed it doesn't return data even though it's there

      Below is the code snippet I'm running

       

      from pyspark.sql.types import Row
      from pyspark.sql.functions import *
      ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", col("date").cast("date"))
      ds.where("date = '2022-01-01'").show()
      ds.write.mode("overwrite").parquet("s3a://bucket/test")
      df = spark.read.format("parquet").load("s3a://bucket/test")
      df.where("date = '2022-01-01'").show()
      

      The first show() returns data, while the second one - no.

      I've noticed that it's Kubernetes master related, as the same code snipped works ok with master "local"

      UPD: if the column is used as a partition and has the type "date" there is no filtering problem.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            h.liashchuk Hanna Liashchuk
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: