Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
0.15.1
-
None
-
None
Description
Hi,
I've encountered what seems to me a bug using
pyarrow==0.15.1 pandas==0.25.3 numpy==1.18.1
I'm trying to write a table containing nanosecond timestamps to a millisecond schema. Here is a minimal example:
import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import numpy as np pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))]) timestamp = np.datetime64("2019-06-21T22:13:02.901123") d = {"datetime_ms": timestamp} df = pd.DataFrame(d, index=range(1)) table = pa.Table.from_pandas(df, schema=pyarrow_schema) pq.write_table( table, "test.parquet", coerce_timestamps="ms", allow_truncated_timestamps=True, )
pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would lose data: 1561155182901123000', 'Conversion failed for column datetime_ms with type datetime64[ns]')
From my understanding, the expected behaviour shoud be arrow allowing the conversion anyway, even if loosing some data.
Related discussions:
This test https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846 does not explicitely check for nanosecond timestamps.
To be honest I've not checked at the code yet, so let me know whether I missed something. I'd be happy to fix it if it's really a bug.