Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-7812

Work around Stackdriver error reporting double counting worker errors

Details

    • Bug
    • Status: Resolved
    • P3
    • Resolution: Fixed
    • None
    • Not applicable
    • runner-dataflow
    • None

    Description

      Objective

      Work around Stackdriver Error Reporting to count worker errors only once when double logging.

      Only applicable to dataflow runner workers in SDK.

      Background

      Stackdriver error reporting will double count worker errors logged to Stackdriver, because:

      1. workers log errors to Stackdriver;
      2. workers report the same errors to dfe and dfe will log them again to Stackdriver.

      The double counting is blocking us sending job message logs from dfe to Stackdriver because we don't want to change the behavior of any existing log and feature.

      There happens to be an inconsistency in Java batch DataflowWorkerLoggingHandler and streaming (StreamingDataflowWorker) error reporting to dfe that results in reported error from streaming Java worker will eventually be ignored by Stackdriver Error Reporting.

      Details

      Inspired by the inconsistency, we decide to apply the streaming Java worker error reporting logic to batch to both fix the inconsistency and work around double counting issue on Stackdriver Error Reporting.

      The change will be when workers reporting errors to dfe,

      • For Java, construct stack trace from StackTrace object instead of using printStackTrace;
      • For Python, report the complete error message details exactly the same to worker logging instead of only reporting traceback through traceback module.

      Users will not experience change since job message logging to Stackdriver hasn’t been launched yet.

      Test Plan

      We'll add unit test for public methods changed in the process.

      Google has internal integration tests where we can push worker harness images and set worker harness container image to test in sandbox.

      When releasing, we also have integration tests in different releasing stages.

      The workaround needs to be released completely before we can enable job message logging.

      We can verify the format of stacktraces in sandbox and release stages by executing example pipelines in our projects and directly browse prod Stackdriver logging and error reporting consoles. This should be done before and after enabling job message logging.

      Run any other existing and required tests before sending PR.

      Attachments

        Activity

          People

            ningk Ning
            ningk Ning
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 0.5h
                0.5h