Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-14383

Improve "FailedRows" errors returned by beam.io.WriteToBigQuery

Details

    • Improvement
    • Status: Resolved
    • P2
    • Resolution: Implemented
    • None
    • 2.39.0
    • io-py-gcp
    • None

    Description

      `WriteToBigQuery` pipeline returns `errors` when trying to insert rows that do not match the BigQuery table schema. `errors` is a dictionary that cointains one `FailedRows` key. `FailedRows` is a list of tuples where each tuple has two elements: BigQuery table name and the row that didn't match the schema.

      This can be verified by running the `BigQueryIO deadletter pattern` https://beam.apache.org/documentation/patterns/bigqueryio/

      Using this approach I can print the failed rows in a pipeline. When running the job, logger simultaneously prints out the reason why the rows were invalid. The reason should also be included in the tuple in addition to the BigQuery table and the raw row. This way next pipeline could process both the invalid row and the reason why it is invalid.

      During my reasearch i found a couple of alternate solutions, but i think they are more complex than they need to be. Thats why i explored the beam source code and found the solution to be an easy and simple change.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Firlej Oskar Firlej
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 10m
                  5h 10m