[HIVE-26270] Wrong timestamps when reading Hive 3.1.x Parquet files with vectorized reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0-alpha-2
Component/s: HiveServer2, Parquet
Labels:

Target Version/s:

4.0.0-alpha-2

Description

Parquet files written in Hive 3.1.x onwards with timezone set to US/Pacific.

CREATE TABLE employee (eid INT, birth timestamp) STORED AS PARQUET;

INSERT INTO employee VALUES 
(1, '1880-01-01 00:00:00'),
(2, '1884-01-01 00:00:00'),
(3, '1990-01-01 00:00:00');

Parquet files read with Hive 4.0.0-apha-1 onwards.

Without vectorization results are correct.

SELECT * FROM employee;

1	1880-01-01 00:00:00
2	1884-01-01 00:00:00
3	1990-01-01 00:00:00

With vectorization some timestamps are shifted.

-- Disable fetch task conversion to force vectorization kick in
set hive.fetch.task.conversion=none;
SELECT * FROM employee;

1	1879-12-31 23:52:58
2	1884-01-01 00:00:00
3	1990-01-01 00:00:00

The problem is the same reported under ~~HIVE-24074~~. The data were written using the new Date/Time APIs (java.time) in version Hive 3.1.3 and here they were read using the old APIs (java.sql).

The difference with ~~HIVE-24074~~ is that here the problem appears only for vectorized execution while the non-vectorized reader is working fine so there is some inconsistency in the behavior of vectorized and non vectorized readers.

Non-vectorized reader works fine cause it derives automatically that it should use the new JDK APIs to read back the timestamp value. This is possible in this case cause there are metadata information in the file (i.e., the presence of writer.time.zone) from where it can infer that the timestamps were written using the new Date/Time APIs.

The inconsistent behavior between vectorized and non-vectorized reader is a regression caused by ~~HIVE-25104~~. This JIRA is an attempt to re-align the behavior between vectorized and non-vectorized readers.

Note that if the file metadata are empty both vectorized and non-vectorized reader cannot determine which APIs to use for the conversion and in this case it is necessary the user to set the
hive.parquet.timestamp.legacy.conversion.enabled explicitly to get back the correct results.

Attachments

Issue Links

is caused by

HIVE-25104 Backward incompatible timestamp serialization in Parquet for certain timezones

Closed

relates to

HIVE-24074 Incorrect handling of timestamp in Parquet/Avro when written in certain time zones in versions before Hive 3.x

Closed

links to

GitHub Pull Request #3338

Activity

People

Assignee:: Stamatis Zampetakis

Reporter:: Stamatis Zampetakis

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/May/22 13:57

Updated:: 16/Nov/22 13:50

Resolved:: 03/Jun/22 08:30

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m