Objective

Work around Stackdriver Error Reporting to count worker errors only once when double logging.

Only applicable to dataflow runner workers in SDK.

Background

Stackdriver error reporting will double count worker errors logged to Stackdriver, because:

workers log errors to Stackdriver;
workers report the same errors to dfe and dfe will log them again to Stackdriver.

The double counting is blocking us sending job message logs from dfe to Stackdriver because we don't want to change the behavior of any existing log and feature.

There happens to be an inconsistency in Java batch DataflowWorkerLoggingHandler and streaming (StreamingDataflowWorker) error reporting to dfe that results in reported error from streaming Java worker will eventually be ignored by Stackdriver Error Reporting.

Details

Inspired by the inconsistency, we decide to apply the streaming Java worker error reporting logic to batch to both fix the inconsistency and work around double counting issue on Stackdriver Error Reporting.

The change will be when workers reporting errors to dfe,

For Java, construct stack trace from StackTrace object instead of using printStackTrace;
For Python, report the complete error message details exactly the same to worker logging instead of only reporting traceback through traceback module.

Users will not experience change since job message logging to Stackdriver hasn’t been launched yet.

Test Plan

We'll add unit test for public methods changed in the process.

Google has internal integration tests where we can push worker harness images and set worker harness container image to test in sandbox.

When releasing, we also have integration tests in different releasing stages.

The workaround needs to be released completely before we can enable job message logging.

We can verify the format of stacktraces in sandbox and release stages by executing example pipelines in our projects and directly browse prod Stackdriver logging and error reporting consoles. This should be done before and after enabling job message logging.

Run any other existing and required tests before sending PR.

Attachments

Activity

People

Assignee:: Ning

Reporter:: Ning

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Jul/19 22:32

Updated:: 24/Jul/20 19:57

Resolved:: 05/Aug/19 17:41

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h