Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37027

Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Trivial
    • Resolution: Unresolved
    • 2.4.5, 3.1.2
    • None
    • SQL
    • None

    Description

      If a Hive table is created with both WITH SERDEPROPERTIES ('path'='<tableLocation>') and LOCATION <tableLocation>, Spark can return doubled rows when reading the table. This issue seems to be an extension of SPARK-30507.

      Reproduce steps:

      1. Create table and insert records via Hive (Spark doesn't allow to insert into table like this)
        CREATE TABLE `test_table`(
          `c1` LONG,
          `c2` STRING)
        ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
        WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" )
        STORED AS
          INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
          OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
        LOCATION '<tableLocationPath>';
        
        INSERT INTO TABLE `test_table`
        VALUES (0, '0');
        
        SELECT * FROM `test_table`;
        -- will return
        -- 0 0
        
      1. Read above table from Spark
        SELECT * FROM `test_table`;
        -- will return
        -- 0 0
        -- 0 0
        

      But if we set spark.sql.hive.convertMetastoreParquet=false, Spark will return same result as Hive (i.e. single row)

      A similar case is that, if a Hive table is created with both WITH SERDEPROPERTIES ('path'='<anotherPath>') and LOCATION <tableLocation>, Spark will read both rows under anotherPath and rows under tableLocation, regardless of spark.sql.hive.convertMetastoreParquet ‘s value. However, actually Hive seems to return only rows under tableLocation

      Another similar case is that, if path is provided in TBLPROPERTIES, Spark won’t double the rows when 'path'='<tableLocation>'. If 'path'='<anotherPath>', Spark will read both rows under anotherPath and rows under tableLocation, Hive seems to keep ignoring the path in TBLPROPERTIES

      Code examples for the above cases (diff patch wrote in HiveParquetMetastoreSuite.scala) can be found in Attachments

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yuzhousun Yuzhou Sun
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: