Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32558

ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 3.0.0
    • None
    • SQL
    • None
    • Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. (Linux Redhat)

    Description

      Steps to reproduce the issue:

      ------------------------------- 

      Download Spark_3.0 from https://spark.apache.org/downloads.html

       

      Step 1) Create ORC File by using the default Spark_3.0 Native API from spark shell .

      [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
      
      Welcome to Spark version 3.0.0
      
      Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
      
      Type in expressions to have them evaluated. Type :help for more information.
       scala> spark.sql("set spark.sql.orc.impl").show()
      
      +-------------------------+
      |               key| value| 
      +-------------------------+
      |spark.sql.orc.impl|native|
      +-------------------------+
       
      scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = []
      scala> spark.sql("insert into df_table values('col1val1','col2val1')")
      org.apache.spark.sql.DataFrame = []
      
      scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
      scala> dFrame.show()
      
      +-----------------+
      |    col1|    col2|
      +-----------------+
      |col1val1|col2val1|
      +-----------------+
      
      scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/ORC_File_Tgt1")
      

       

      Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file.

      adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/ORC_File_Tgt1/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414]
      Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
      at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
      at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
      at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
      at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
      at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
      at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
      at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
      at org.apache.orc.tools.FileDump.main(FileDump.java:154)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
      

      Step 3) Now Create ORC File using the Hive API (as suggested by Spark in https://spark.apache.org/docs/latest/sql-migration-guide.html by setting spark.sql.orc.impl as hive)

      scala> spark.sql("set spark.sql.orc.impl=hive")
      res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
      scala> spark.sql("set spark.sql.orc.impl").show()
      
      +------------------------+
      |               key|value| 
      +------------------------+
      |spark.sql.orc.impl| hive|
      +------------------------+
      
      scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
      scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = []
      scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
      scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/ORC_File_Tgt2")
      

       

      Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive

      [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/ORC_File_Tgt2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
      Processing data file /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414]
      Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
      at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
      at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
      at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
      at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
      at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
      at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
      at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
      at org.apache.orc.tools.FileDump.main(FileDump.java:154)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
      

      Note: The same case works fine if you try metadata fetch from Hive_2.3 or above versions.

      *So the main concern here is that setting spark.sql.orc.impl to hive is not producing ORC files that will work with Hive_2.1.1 or below.
      Can someone help here. Is there any other workaround available? Can this be looked into on priority? Thank you.
       
      References:
      https://spark.apache.org/docs/latest/sql-migration-guide.html  (workaround of setting spark.sql.orc.impl=hive is mentioned here which is not working):

      Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, spark.sql.orc.impl and spark.sql.orc.filterPushdown change their default values to native and true respectively. ORC files created by native ORC writer cannot be read by some old Apache Hive releases. Use spark.sql.orc.impl=hive to create the files shared with Hive 2.1.1 and older.""

      https://issues.apache.org/jira/browse/SPARK-26932
      https://issues.apache.org/jira/browse/HIVE-16683

      Attachments

        1. image-2020-08-09-14-07-00-521.png
          30 kB
          Ramakrishna Prasad K S
        2. Screen Shot 2020-08-09 at 11.41.59 AM.png
          69 kB
          Dongjoon Hyun

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ramks Ramakrishna Prasad K S
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: