Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.1.0, 2.2.1
-
None
-
None
Description
Hi
I found a really strange issue. Below are the steps to reproduce it. This issue occurs only when the table row format is ParquetHiveSerDe and the target table is Partitioned
Hive:
Login in to hive terminal on cluster and create below tables.
create table t_src( name varchar(10), dob timestamp ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' create table t_tgt( name varchar(10), dob timestamp ) PARTITIONED BY (city varchar(10)) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
Insert data into the source table (t_src)
INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
Spark-shell:
Get on to spark-shell.
Execute below commands on spark shell:
import org.apache.spark.sql.hive.HiveContext val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val q0 = "TRUNCATE table t_tgt" val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM DEFAULT.t_src alias" val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as c0, tbl0.a1 as c1, NULL as c2 FROM tbl0" sqlContext.sql(q0) sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") sqlContext.sql(q2)
After this check the contents of target table t_tgt. You will see the date "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows the contents of both the tables:
select * from t_src; +-------------+------------------------+--+ | t_src.name | t_src.dob | +-------------+------------------------+--+ | p1 | 0001-01-01 00:00:00.0 | | p2 | 0002-01-01 00:00:00.0 | | p3 | 0003-01-01 00:00:00.0 | | p4 | 0004-01-01 00:00:00.0 | +-------------+------------------------+–+ select * from t_tgt; +-------------+------------------------+--+ | t_src.name | t_src.dob | t_tgt.city | +-------------+------------------------+--+ | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF | | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF | | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF | | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF | +-------------+------------------------+--+
Is this a known issue? Is it fixed in any subsequent releases?
Thanks & regards,
Pawan Lawale
Attachments
Issue Links
- is duplicated by
-
SPARK-25873 Date corruption when Spark and Hive both are on different timezones
- Resolved