Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.2.1
-
None
-
None
Description
There is date alteration when loading date from one table to another in hive through spark. This happens when Hive is on a remote machine with timezone different than the one on which Spark is running. This happens only when the Source table format is 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
Below are the steps to produce the issue:
1. Create two tables as below in hive which has a timezone, say in, EST
CREATE TABLE t_src( name varchar(10), dob timestamp ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
CREATE TABLE t_tgt( name varchar(10), dob timestamp );
2. Copy hive-site.xml to spark-2.2.1-bin-hadoop2.7/conf folder, so that when you create sqlContext for hive it connects to your remote hive server.
3. Start your spark-shell on some other machine whose timezone is different than that of Hive, say, PDT
4. Execute below code:
import org.apache.spark.sql.hive.HiveContext val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val q0 = "TRUNCATE table t_tgt" val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src alias" val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 FROM tbl0" sqlContext.sql(q0) sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") sqlContext.sql(q2)
5. Now navigate to hive and check the contents of the TARGET table (t_tgt). The dob field will have incorrect values.
Is this a known issue? Is there any work around on this? Can it be fixed?
Thanks & regards,
Pawan Lawale
Attachments
Issue Links
- duplicates
-
SPARK-25919 Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
- Closed