Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12101

Dataflow Jobs keep failing with FileNotFoundError: [Errno 2] Not found: gs://tmp.../beamapp..../tmp-27400e24c0c31bc1-00000-of-00001.avro

Details

    • Bug
    • Status: Triage Needed
    • P3
    • Resolution: Unresolved
    • 2.28.0
    • Not applicable
    • io-py-avro
    • None
    • google cloud platform.
      Kicking off the job locally from WSL ubuntu 20.0
      python version 3.8.5
    • Important
    • Hide
      downloader = GcsDownloader(\n File \"/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py\", line 576, in __init__\n raise IOError(errno.ENOENT, 'Not found: %s' % self._path)\nFileNotFoundError: [Errno 2] Not found: gs://dataflow-tmp/tmp/beamapp-xct-0406085439-530796.1617699279.530912/tmp-fce2e27fbb2e8e1f-00000-of-00001.avro\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File
      Show
      downloader = GcsDownloader(\n File \"/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py\", line 576, in __init__\n raise IOError(errno.ENOENT, 'Not found: %s' % self._path)\nFileNotFoundError: [Errno 2] Not found: gs://dataflow-tmp/tmp/beamapp-xct-0406085439-530796.1617699279.530912/tmp-fce2e27fbb2e8e1f-00000-of-00001.avro \n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File

    Description

      I am processing up to a 1000 files .......xml.gz
      When I run a sample of 128 256, and 512 it works but not always.
      I have used between 8 and 512 workers. It seems anytime the job runs for longer then 30 minutes the job fails with FileNotFoundError: errot related to fastavro.

              lines = (
                      p1
                      | "Get name" >> beam.Create(names[(no_of_files * (i - 1)) // no_of_jobs: (no_of_files * i) // no_of_jobs])
                      | "Read from cloud" >> beam.ParDo(ReadGCS())
                      | "Parse into JSON" >> beam.ParDo(ParseXML())
                      | "Get Medline" >> beam.ParDo(GetMedline())
                      | "Build Json" >> beam.ParDo(JsonBuilder())
                      | "Write elements" >> beam.io.WriteToBigQuery(table=table_ref,
                                                                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                                                                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                                                                    schema="SCHEMA_AUTODETECT",
                                                                    insert_retry_strategy=RetryStrategy.RETRY_ALWAYS,
                                                                    ignore_insert_ids=True, validate=False)
              )
      

      Attachments

        1. downloaded-logs-20210406-112843.json
          36 kB
          Patrick Linnane

        Activity

          People

            Unassigned Unassigned
            xct Patrick Linnane
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: