Details
-
Bug
-
Status: Triage Needed
-
P3
-
Resolution: Unresolved
-
2.28.0
-
None
-
google cloud platform.
Kicking off the job locally from WSL ubuntu 20.0
python version 3.8.5
-
Important
-
Description
I am processing up to a 1000 files .......xml.gz
When I run a sample of 128 256, and 512 it works but not always.
I have used between 8 and 512 workers. It seems anytime the job runs for longer then 30 minutes the job fails with FileNotFoundError: errot related to fastavro.
lines = ( p1 | "Get name" >> beam.Create(names[(no_of_files * (i - 1)) // no_of_jobs: (no_of_files * i) // no_of_jobs]) | "Read from cloud" >> beam.ParDo(ReadGCS()) | "Parse into JSON" >> beam.ParDo(ParseXML()) | "Get Medline" >> beam.ParDo(GetMedline()) | "Build Json" >> beam.ParDo(JsonBuilder()) | "Write elements" >> beam.io.WriteToBigQuery(table=table_ref, create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, schema="SCHEMA_AUTODETECT", insert_retry_strategy=RetryStrategy.RETRY_ALWAYS, ignore_insert_ids=True, validate=False) )