Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10343

[C++] Unable to parse strings into timestamps

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 1.0.1
    • None
    • C++, Python
    • macOS 10.15.7, Python 3.8.2

    Description

      Hi,

      I'm working with parquet files generated by a AWS RDS Postgres snapshot export. 

      I'm trying to parse a date column stored as a string into a timestamp, but it fails.

      I've managed to parse the same date format (as in the first example below) when reading from a csv, so I tried to investigate it as far as I could on my own, and here's my results:

      import pyarrow as pa
      import pytz
      
      #################################################################################
      ## the format I get from the database
      us_tz_arr = pa.array([
        "2014-12-07 07:48:59.285332+00",
        "2014-12-07 08:01:49.758975+00",
        "2014-12-07 10:11:35.884304+00"])
      
      us_tz_arr.cast(pa.timestamp('us', tz=pytz.UTC))
      -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304+00
      
      #################################################################################
      ## tried removing the timezone
      us_arr = pa.array([
        "2014-12-07 07:48:59.285332",
        "2014-12-07 08:01:49.758975",
        "2014-12-07 10:11:35.884304"])
      
      us_arr.cast(pa.timestamp('us'))
      -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304
      
      #################################################################################
      ## tried removing the microseconds but keeping the timezone
      second_tz_arr = pa.array([
        "2014-12-07 07:48:59+00",
        "2014-12-07 08:01:49+00",
        "2014-12-07 10:11:35+00"])
      
      second_tz_arr.cast(pa.timestamp('s', tz=pytz.UTC))
      -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35+00
      
      #################################################################################
      ## removing microseconds and timezone, makes it work!
      s_arr = pa.array([
        "2014-12-07 07:48:59",
        "2014-12-07 08:01:49",
        "2014-12-07 10:11:35"])
      
      s_arr.cast(pa.timestamp('s'))
      -> <pyarrow.lib.TimestampArray object at 0x7fbdf81ae460>
      [
        2014-12-07 07:48:59,
        2014-12-07 08:01:49,
        2014-12-07 10:11:35
      ]

       PS. This is my first bug report, so apologies if important things are missing.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              niclas.roos Niclas Roos
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: