Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-10762

Beam Python on Flink fails when no artifacts staged

Details

    Description

      When a Beam job with no artifacts staged is deployed on a Beam-on-Flink cluster (without the JobServer), it crashes the Python beam worker pool and does not recover.  This causes that job (and subsequent jobs that would have used that task manager slot) to hang and fail.  Strangely, if a Beam job with artifacts staged is run on that Beam worker pool container instance (i.e. using that task manager slot), subsequent jobs without artifacts staged that end up using that task manager slot run properly and succeed.

      When this happens, the error from the worker pool container looks like this:

      2020-08-19 01:58:12.287 PDT
      2020/08/19 08:58:12 Initializing python harness: /opt/apache/beam/boot --id=1-1 --logging_endpoint=localhost:38839 --artifact_endpoint=localhost:43187 --provision_endpoint=localhost:37259 --control_endpoint=localhost:34931
      2020-08-19 01:58:12.305 PDT
      2020/08/19 08:58:12 Failed to retrieve staged files: failed to get manifest
      2020-08-19 01:58:12.305 PDT
      caused by:
      2020-08-19 01:58:12.305 PDT
      rpc error: code = Unimplemented desc = Method not found: org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest
      

      To reproduce this, it is sufficient to run the wordcount example modified to set the PipelineOptions flag `save_main_session=False` with Beam 2.23.

      python -m apache_beam.examples.wordcount --runner=FlinkRunner --flink_master=$FLINK_MASTER_HOST:8081 --flink_submit_uber_jar --environment_type=EXTERNAL --environment_config=localhost:50000 --input gs://dataflow-samples/shakespeare/kinglear.txt --output gs://my/bucket/location
      

      This was tested in a Beam-on-Flink cluster deployed on Kubernetes (https://github.com/GoogleCloudPlatform/flink-on-k8s-operator) with a modified version of this patch to use Beam 2.23 (https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/301, i.e. with instances of "2.22" replaced with "2.23").

      Attachments

        Issue Links

          Activity

            People

              ibzib Kyle Weaver
              ccy Charles Chen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m