[ARROW-7747] [Python] coerce_timestamps + allow_truncated_timestamps does not work as expected with nanoseconds - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 0.15.1
Fix Version/s: None
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/23985

Description

Hi,

I've encountered what seems to me a bug using

pyarrow==0.15.1
pandas==0.25.3
numpy==1.18.1

I'm trying to write a table containing nanosecond timestamps to a millisecond schema. Here is a minimal example:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))])

timestamp = np.datetime64("2019-06-21T22:13:02.901123")

d = {"datetime_ms": timestamp}

df = pd.DataFrame(d, index=range(1))

table = pa.Table.from_pandas(df, schema=pyarrow_schema)

pq.write_table(
    table,
    "test.parquet",
    coerce_timestamps="ms",
    allow_truncated_timestamps=True,
)

pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would lose data: 1561155182901123000', 'Conversion failed for column datetime_ms with type datetime64[ns]')

From my understanding, the expected behaviour shoud be arrow allowing the conversion anyway, even if loosing some data.

Related discussions:

This test https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846 does not explicitely check for nanosecond timestamps.

To be honest I've not checked at the code yet, so let me know whether I missed something. I'd be happy to fix it if it's really a bug.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Théophile Chevalier

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Feb/20 08:13

Updated:: 11/Jan/23 07:55

Resolved:: 05/Feb/20 14:18