Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11329

HDFS not deduplicating identical configuration paths.

Details

    • Bug
    • Status: Triage Needed
    • P3
    • Resolution: Fixed
    • None
    • 2.28.0
    • None

    Description

      Originally reported by Yuhong on the dev list: https://lists.apache.org/thread.html/r6a61c94e6d14aa9e8b56ff4919c0bea17fceada446d1193d19fd9ed2%40%3Cdev.beam.apache.org%3E

      Caused by: java.lang.IllegalArgumentException: The HadoopFileSystemRegistrar currently only supports at most a single Hadoop configuration.
      at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141) ~[beam-vendor-guava-26_0-jre-0.1.jar:?]
      at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:60) ~[beam-sdks-java-io-hadoop-file-system-3.2250.5.jar:?]
      at org.apache.beam.sdk.io.FileSystems.verifySchemesAreUnique(FileSystems.java:496) ~[beam-sdks-java-core-3.2250.5.jar:?]
      at org.apache.beam.sdk.io.FileSystems.setDefaultPipelineOptions(FileSystems.java:486) ~[beam-sdks-java-core-3.2250.5.jar:?]
      at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:47) ~[beam-sdks-java-core-3.2250.5.jar:?]
      at org.apache.beam.sdk.Pipeline.create(Pipeline.java:149) ~[beam-sdks-java-core-3.2250.5.jar:?]

      I tried to debug and printed some logs using
      List<Configuration> configurations =
      pipelineOpts.as(HadoopFileSystemOptions.class).getHdfsConfiguration();
      LOG.info("print hdfsConfiguration for testing: " + configurations.toString());

      2020-11-19 18:02:26.289 [main] HelloBeam [INFO] print hdfsConfiguration for testing:
      [Configuration: /export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/genConfig/core-site.xml,
      Configuration: /export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/genConfig/core-site.xml]

      as you can see the hdfsConfiguration is a list and contains two same elements, which caused the error.
      I noticed that the configurations are generated according to HADOOP_CONF_DIR and YARN_CONF_DIR. In the class, a set is used to dedup,
      however, in my test environment, the two dirs are:

      HADOOP_CONF_DIR=/export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/bin/../genConfig/
      YARN_CONF_DIR=/export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/bin/../genConfig

      HADOOP_CONF_DIR contains a '/' at the end so these two dir are considered to be different and then got added twice.

      Attachments

        Issue Links

          Activity

            People

              YC Yuhong Cheng
              ibzib Kyle Weaver
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h