Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-3986

Cannot access any JAR in yarn cluster mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.1, 0.8.2, 0.9.0
    • 0.8.2, 0.9.0
    • Interpreters
    • None
    • Cloudera/CDH 6.1

      Spark 2.4

      Hadoop 3.0

      Zeppelin 0.8.2 (built from the latest merged pull request)

    Description

      Hello,

      YARN cluster mode was introduced in `0.8.0` and fixed for not finding ZeppelinContext in `0.8.1`. However, I have difficulties to access any JAR in order to `import` them inside my notebook.

      I have a CDH cluster, where everything works in deployMode `client`, but the moment I switch to `cluster` and the driver is not the same machine as Zeppelin server it can't find the packages.

      Working configs

      Inside interpreter:

      master: yarn

      spark.submit.deployMode: client

      Inside `zeppelin-env.sh`:

      export ZEPPELIN_IMPERSONATE_SPARK_PROXY_USER=false
      export ZEPPELIN_IMPERSONATE_CMD='sudo -H -u ${ZEPPELIN_IMPERSONATE_USER} bash -c '
      
      export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
      export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
      export SPARK_CONF_DIR=$SPARK_HOME/conf
      export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
      
      export PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda/envs/py36/bin/python3
      export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/envs/py36/bin/python3
      export PYTHONPATH=/opt/cloudera/parcels/Anaconda/envs/py36/bin/python3
      
      export SPARK_SUBMIT_OPTIONS="--jars hdfs:///user/maziyar/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar
      

       

      Since the JAR is already on HDFS, switching to `cluster` should be as simple as changing `spark.submit.deployMode` to the cluster. However, doing that results in: 

      import org.graphframes._
      
      <console>:23: error: object graphframes is not a member of package org import org.graphframes._
      

      I can see my JAR in Spark UI in `spark.yarn.dist.jars` and `spark.yarn.secondary.jars` in both cluster and client mode.

       

      In client mode `sc.jars` will result: 

      res2: Seq[String] = List(file:/opt/zeppelin-0.8.2-new/interpreter/spark/spark-interpreter-0.8.2-SNAPSHOT.jar)

       However, in `cluster` mode the same command is empty. I thought maybe there is something extra or missing on Zeppelin Spark Interpreter that doesn't allow the JAR being used in cluster mode.

       

      This is how Spark UI reports my JAR in `client` mode:

      spark.repl.local.jars   
      file:/tmp/spark-3aadfe3c-8821-4dfe-875b-744c2e35a95a/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar 
      
      spark.yarn.dist.jars   
      hdfs://hadoop-master-1:8020/user/mpanahi/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar 
      
      spark.yarn.secondary.jars 
      graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar 
      
      sun.java.command 
      org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.executor.memory=5g --conf spark.driver.memory=8g --conf spark.driver.cores=4 --conf spark.yarn.isPython=true --conf spark.driver.extraClassPath=:/opt/zeppelin-0.8.2-new/interpreter/spark/:/opt/zeppelin-0.8.2-new/zeppelin-interpreter/target/lib/::/opt/zeppelin-0.8.2-new/zeppelin-interpreter/target/classes:/opt/zeppelin-0.8.2-new/zeppelin-interpreter/target/test-classes:/opt/zeppelin-0.8.2-new/zeppelin-zengine/target/test-classes:/opt/zeppelin-0.8.2-new/interpreter/spark/spark-interpreter-0.8.2-SNAPSHOT.jar --conf spark.useHiveContext=true --conf spark.app.name=Zeppelin --conf spark.executor.cores=5 --conf spark.submit.deployMode=client --conf spark.dynamicAllocation.maxExecutors=50 --conf spark.dynamicAllocation.initialExecutors=1 --conf spark.dynamicAllocation.enabled=true --conf spark.driver.extraJavaOptions= -Dfile.encoding=UTF-8 -Dlog4j.configuration=file:///opt/zeppelin-0.8.2-new/conf/log4j.properties -Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark-mpanahi-zeppelin-hadoop-gateway.log --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer --jars hdfs:///user/mpanahi/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar
      

       

      This is how Spark UI reports my JAR in `cluster` mode (same configs as I mentioned above):

      spark.repl.local.jars 	This field does not exist in cluster mode
      
      spark.yarn.dist.jars 	
      hdfs://hadoop-master-1:8020/user/mpanahi/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar
      
      spark.yarn.secondary.jars	
      graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar
      
      sun.java.command	
      org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer --jar file:/opt/zeppelin-0.8.2-new/interpreter/spark/spark-interpreter-0.8.2-SNAPSHOT.jar --arg 134.158.74.122 --arg 46130 --arg : --properties-file /yarn/nm/usercache/mpanahi/appcache/application_1547731772080_0077/container_1547731772080_0077_01_000001/_spark_conf/spark_conf_.properties
      
      

        

      UPDATE: In Zeppelin 0.9.0, if I run this at the beginning, not only this JAR is accessible, but also all the JARs in --jar inside zeppelin-env.sh! If I don't do this it will fail as I mentioned before.

      %spark.conf
      
      spark.app.name multivac
      spark.jars hdfs:///user/maziyar/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar
      

       

      I kind of understand if graphframes becomes available even though it was already in the --jar, but this made the rest of my JARs also available means there is something here that pushes the others into the cluster. 

      Thank you.

       

      Attachments

        Issue Links

          Activity

            People

              zjffdu Jeff Zhang
              maziyar Maziyar PANAHI
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h