Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28266

data duplication when `path` serde property is present

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0, 2.2.1, 2.2.2
    • 3.2.0, 3.1.3, 3.0.4
    • Spark Core

    Description

      Spark duplicates returned datasets when `path` serde is present in a parquet table. 

      Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.

      Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 at least).

      Reproducer:

      >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
      DataFrame[]
      
      >>> spark.table("ruslan_test.test55").explain()
      
      == Physical Plan ==
      HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
      
      >>> spark.table("ruslan_test.test55").count()
      1
      
      

      (all is good at this point, now exist session and run in Hive for example - )

      ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
      

      So LOCATION and serde `path` property would point to the same location.
      Now see count returns two records instead of one:

      >>> spark.table("ruslan_test.test55").count()
      2
      
      >>> spark.table("ruslan_test.test55").explain()
      == Physical Plan ==
      *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
      >>>
      
      

      Also notice that the presence of `path` serde property makes TABLE location
      show up twice -

      InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive...,

      We have some applications that create parquet tables in Hive with `path` serde property
      and it makes data duplicate in query results.

      Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but not Spark 2.2 and later releases.

      Attachments

        Issue Links

          Activity

            People

              shardulm Shardul Mahadik
              Tagar Ruslan Dautkhanov
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: