Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46018

driver fails to start properly in standalone cluster deployment mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.3
    • None
    • Deploy, Spark Submit
    • None

    Description

      Without adding the `SPARK_LOCAL_HOSTNAME` attribute to spark-env.sh It is possible to elect and create the driver normally.

      The submitted command is:

      bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://XXX:7077 --deploy-mode cluster /opt/module/spark3.1/examples/jars/spark-examples_2.12-3.1.3.jar 1000

      The command to start the driver uses the real ip of the worker:

      "spark://Worker@169.XXX.XXX.211:7078" 

      The driver can run on any worker.

       

      After adding the SPARK_LOCAL_HOSTNAME attribute to each worker, the driver cannot run on workers other than the one that submitted the command.

      The command to start the driver uses the hostname of the worker:

      "spark://Worker@hostname:7078" 

      The error message is:

      Launch Command: "/opt/module/jdk1.8.0_371/bin/java" "-cp" "/opt/module/spark3.1/conf/:/opt/module/spark3.1/jars/*:/opt/module/hadoop-3.3.0/etc/hadoop/" "-Xmx6144M" "-Dspark.eventLog.enabled=true" "-Dspark.driver.cores=4" "-Dspark.jars=file:/opt/module/spark3.1/examples/jars/spark-examples_2.12-3.1.3.jar" "-Dspark.submit.deployMode=cluster" "-Dspark.sql.shuffle.partitions=60" "-Dspark.master=spark://nodeA:7077" "-Dspark.executor.cores=4" "-Dspark.driver.supervise=false" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.driver.memory=6g" "-Dspark.eventLog.compress=true" "-Dspark.executor.memory=8g" "-Dspark.submit.pyFiles=" "-Dspark.eventLog.dir=hdfs://nodeA:8020/sparklog/" "-Dspark.rpc.askTimeout=10s" "-Dspark.default.parallelism=60" "-Dspark.history.fs.cleaner.enabled=false" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@nodeB:7078" "/opt/module/spark3.1/work/driver-20231121105901-0000/spark-examples_2.12-3.1.3.jar" "org.apache.spark.examples.SparkPi" "1000"
      ========================================
      
      23/11/21 10:59:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      23/11/21 10:59:02 INFO SecurityManager: Changing view acls to: jaken
      23/11/21 10:59:02 INFO SecurityManager: Changing modify acls to: jaken
      23/11/21 10:59:02 INFO SecurityManager: Changing view acls groups to: 
      23/11/21 10:59:02 INFO SecurityManager: Changing modify acls groups to: 
      23/11/21 10:59:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jaken); groups with view permissions: Set(); users  with modify permissions: Set(jaken); groups with modify permissions: Set()
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      23/11/21 10:59:03 WARN Utils: Service 'Driver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
      Exception in thread "main" java.net.BindException:Unable to specify the requested address: Service 'Driver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'Driver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
      	at sun.nio.ch.Net.bind0(Native Method)
      	at sun.nio.ch.Net.bind(Net.java:438)
      	at sun.nio.ch.Net.bind(Net.java:430)
      	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:225)
      	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
      	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
      	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
      	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
      	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
      	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
      	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
      	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
      	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
      	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
      	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
      	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
      	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
      	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      	at java.lang.Thread.run(Thread.java:750) 

      I ruled out the driver port as it reported the same error no matter how the port was set. I used the `SPARK_LOCAL_HOSTNAME` attribute to avoid the `Locality Level is ANY` problem, see [ SPARK-10149 | https://issues.apache.org/jira/browse/SPARK-10149 ]

      Attachments

        Activity

          People

            Unassigned Unassigned
            srymaker^^ xiejiankun
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: