Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-1251 Python 3 Support
  3. BEAM-6154

Gcsio batch delete broken in Python 3

Details

    • Sub-task
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 2.11.0
    • sdk-py-core
    • None

    Description

      I'm running Python SDK agianst GCP in Python 3.5 and got following gcsio error while deleting files:

        File "/usr/local/lib/python3.5/site-packages/apache_beam/io/iobase.py", line 1077, in <genexpr>
          window.TimestampedValue(v, timestamp.MAX_TIMESTAMP) for v in outputs)
        File "/usr/local/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", line 315, in finalize_write
          num_threads)
        File "/usr/local/lib/python3.5/site-packages/apache_beam/internal/util.py", line 145, in run_using_threadpool
          return pool.map(fn_to_execute, inputs)
        File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 266, in map
          return self._map_async(func, iterable, mapstar, chunksize).get()
        File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 644, in get
          raise self._value
        File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 119, in worker
          result = (True, func(*args, **kwds))
        File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
          return list(map(*args))
        File "/usr/local/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", line 299, in _rename_batch
          FileSystems.rename(source_files, destination_files)
        File "/usr/local/lib/python3.5/site-packages/apache_beam/io/filesystems.py", line 252, in rename
          return filesystem.rename(source_file_names, destination_file_names)
        File "/usr/local/lib/python3.5/site-packages/apache_beam/io/gcp/gcsfilesystem.py", line 229, in rename
          copy_statuses = gcsio.GcsIO().copy_batch(batch)
        File "/usr/local/lib/python3.5/site-packages/apache_beam/io/gcp/gcsio.py", line 322, in copy_batch
          api_calls = batch_request.Execute(self.client._http)  # pylint: disable=protected-access
        File "/usr/local/lib/python3.5/site-packages/apitools/base/py/batch.py", line 222, in Execute
          batch_http_request.Execute(http)
        File "/usr/local/lib/python3.5/site-packages/apitools/base/py/batch.py", line 480, in Execute
          self._Execute(http)
        File "/usr/local/lib/python3.5/site-packages/apitools/base/py/batch.py", line 450, in _Execute
          mime_response = parser.parsestr(header + response.content)
      TypeError: Can't convert 'bytes' object to str implicitly
      

      After looking into related code in apitools library, I found response.content that's returned via http request to gcs is bytes and apitools didn't handle this scenario. This can be a blocker to any pipeline depending on gcsio and apparently blocks all Dataflow job in Python 3.

      This could be another case that moving off apitools dependency in BEAM-4850.

      Attachments

        Issue Links

          Activity

            People

              markflyhigh Mark Liu
              markflyhigh Mark Liu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h