Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30559

spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.4
    • None
    • SQL
    • EMR 28.1 with Spark 2.4.4, Hadoop 2.8.5 and Hive 2.3.6

    Description

      In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and INFER_AND_SAVE do not work as intended. They were supposed to infer a case-sensitive schema from the underlying files, but they do not work.

      1. INFER_ONLY never works: it will always user lowercase column names from Hive metastore schema
      2. INFER_AND_SAVE only works the second time spark.sql("SELECT …") is called (the first time it writes the schema to TBLPROPERTIES in the metastore and subsequent calls read that schema, so they do work)

      Expected behavior (according to SPARK-19611)

      INFER_ONLY - infer the schema from the underlying files

      INFER_AND_SAVE - infer the schema from the underlying files, save it to the metastore, and read it from the metastore on any subsequent calls

      Reproduce

      Prepare the data

      1) Create a Parquet file

      scala> List(("a", 1), ("b", 2)).toDF("theString", "theNumber").write.parquet("hdfs:///t")

       

      2) Inspect the Parquet files

      $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00000-….snappy.parquet
      {"theString":"a","theNumber":1}
      $ hadoop jar parquet-tools-1.11.0.jar cat -j hdfs:///t/part-00001-….snappy.parquet
      {"theString":"b","theNumber":2}

      We see that they are saved with camelCase column names.

      3) Create a Hive table 

      hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
       > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
       > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
       > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
       > LOCATION 'hdfs:///t';

       

      Reproduce INFER_ONLY bug

      3) Read the table in Spark using INFER_ONLY

      $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY
      scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
      thestring
      thenumber
      

      Conclusion

      When INFER_ONLY is set, column names are lowercase always.

      Reproduce INFER_AND_SAVE bug

      1) Run the for first time

      $ spark-shell --master local[*] --conf spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE
      scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
      thestring
      thenumber

      We see that column names are lowercase

      2) Run for the second time

      scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
      theString
      theNumber

      We see that the column names are camelCase

      Conclusion

      When INFER_AND_SAVE is set, column names are lowercase on first call and camelCase on subsquent calls.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            oripwk Ori Popowski
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: