[SPARK-32558] ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working) - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Invalid
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: SQL
Labels:
None
Environment:

Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. (Linux Redhat)

Description

Steps to reproduce the issue:

-------------------------------

Download Spark_3.0 from https://spark.apache.org/downloads.html

Step 1) Create ORC File by using the default Spark_3.0 Native API from spark shell .

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

+-------------------------+
|               key| value| 
+-------------------------+
|spark.sql.orc.impl|native|
+-------------------------+
 
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = []
scala> spark.sql("insert into df_table values('col1val1','col2val1')")
org.apache.spark.sql.DataFrame = []

scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> dFrame.show()

+-----------------+
|    col1|    col2|
+-----------------+
|col1val1|col2val1|
+-----------------+

scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/ORC_File_Tgt1")

Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file.

adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/ORC_File_Tgt1/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414]
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
at org.apache.orc.tools.FileDump.main(FileDump.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in https://spark.apache.org/docs/latest/sql-migration-guide.html by setting spark.sql.orc.impl as hive)

scala> spark.sql("set spark.sql.orc.impl=hive")
res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> spark.sql("set spark.sql.orc.impl").show()

+------------------------+
|               key|value| 
+------------------------+
|spark.sql.orc.impl| hive|
+------------------------+

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = []
scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/ORC_File_Tgt2")

Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive

[adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/ORC_File_Tgt2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
Processing data file /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414]
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
at org.apache.orc.tools.FileDump.main(FileDump.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

Note: The same case works fine if you try metadata fetch from Hive_2.3 or above versions.

*So the main concern here is that setting spark.sql.orc.impl to hive is not producing ORC files that will work with Hive_2.1.1 or below.
Can someone help here. Is there any other workaround available? Can this be looked into on priority? Thank you.

References:
https://spark.apache.org/docs/latest/sql-migration-guide.html (workaround of setting spark.sql.orc.impl=hive is mentioned here which is not working):

Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, spark.sql.orc.impl and spark.sql.orc.filterPushdown change their default values to native and true respectively. ORC files created by native ORC writer cannot be read by some old Apache Hive releases. Use spark.sql.orc.impl=hive to create the files shared with Hive 2.1.1 and older.""

https://issues.apache.org/jira/browse/SPARK-26932
https://issues.apache.org/jira/browse/HIVE-16683

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2020-08-09-14-07-00-521.png
09/Aug/20 08:37
30 kB
Ramakrishna Prasad K S
Screen Shot 2020-08-09 at 11.41.59 AM.png
09/Aug/20 18:42
69 kB
Dongjoon Hyun

Issue Links

duplicates

HIVE-16683 ORC WriterVersion gets ArrayIndexOutOfBoundsException on newer ORC files

Closed

ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates