Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12875

File systems are not registered when ArtifactRetrievalService is created by Spark runner

Details

    • Improvement
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.32.0
    • 2.35.0
    • runner-spark
    • None

    Description

      I am new to this codebase so apologies if I have any misunderstandings, but from what I can tell when SparkExecutableStageFunction is called an ArtifactRetrievalService is created (if the job bundle factory's environment cache is cold) to be called by the worker harness.

      The issue is that FileSystems.setDefaultPipelineOptions is not called before this, so no filesystems are registered. If one is using cloud storage such as S3 to stage artifacts, then the ArtifactRetrievalService will not be able to retrieve the artifacts and throw an exception:
       java.lang.IllegalArgumentException: No filesystem found for scheme s3

      This doesn't affect other runners such as the Flink runner because it calls FileSystems.setDefaultPipelineOptions in its executable stage function 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              roganmorrow Rogan Morrow
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m