Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-14284

Server-side Dataflow job idempotence

Details

    • Improvement
    • Status: Open
    • P2
    • Resolution: Unresolved
    • None
    • None
    • runner-dataflow
    • None

    Description

      Issue: when a job submission is retried, it may result in duplicate Dataflow jobs. The Dataflow job name only guarantees uniqueness for active jobs – that is, if a job with the same name exists but is already completed, the same name is allowed again. What we would like is job uniqueness regardless of job status.

      The Dataflow API provides a way to ensure unique jobs through the use of clientRequestId:

      The client's unique identifier of the job, re-used 
      across retried attempts. If this field is set, the service will ensure 
      its uniqueness. The request to create a job will fail if the service has
       knowledge of a previously submitted job with the same client's ID and 
      job name. The caller may use this field to ensure idempotence of job 
      creation across retried attempts to create a job. By default, the field 
      is empty and, in that case, the service ignores it. 

      https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs

      In DataflowRunner.java, clientRequestId is set with a randomized value.

      Proposed solution: provide the ability to pass in a clientRequestId through DataflowPipelineOptions and set it on the Job when available, otherwise default to the randomized value.

      Attachments

        Activity

          People

            Unassigned Unassigned
            toltol tol
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: