[ZEPPELIN-3986] Cannot access any JAR in yarn cluster mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.1, 0.8.2, 0.9.0
Fix Version/s: 0.8.2, 0.9.0
Component/s: Interpreters
Labels:
None
Environment:

Cloudera/CDH 6.1

Spark 2.4

Hadoop 3.0

Zeppelin 0.8.2 (built from the latest merged pull request)

Description

Hello,

YARN cluster mode was introduced in `0.8.0` and fixed for not finding ZeppelinContext in `0.8.1`. However, I have difficulties to access any JAR in order to `import` them inside my notebook.

I have a CDH cluster, where everything works in deployMode `client`, but the moment I switch to `cluster` and the driver is not the same machine as Zeppelin server it can't find the packages.

Working configs

Inside interpreter:

master: yarn

spark.submit.deployMode: client

Inside `zeppelin-env.sh`:

export ZEPPELIN_IMPERSONATE_SPARK_PROXY_USER=false
export ZEPPELIN_IMPERSONATE_CMD='sudo -H -u ${ZEPPELIN_IMPERSONATE_USER} bash -c '

export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_CONF_DIR=$SPARK_HOME/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf

export PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda/envs/py36/bin/python3
export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/envs/py36/bin/python3
export PYTHONPATH=/opt/cloudera/parcels/Anaconda/envs/py36/bin/python3

export SPARK_SUBMIT_OPTIONS="--jars hdfs:///user/maziyar/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar

Since the JAR is already on HDFS, switching to `cluster` should be as simple as changing `spark.submit.deployMode` to the cluster. However, doing that results in:

import org.graphframes._

<console>:23: error: object graphframes is not a member of package org import org.graphframes._

I can see my JAR in Spark UI in `spark.yarn.dist.jars` and `spark.yarn.secondary.jars` in both cluster and client mode.

In client mode `sc.jars` will result:

res2: Seq[String] = List(file:/opt/zeppelin-0.8.2-new/interpreter/spark/spark-interpreter-0.8.2-SNAPSHOT.jar)

However, in `cluster` mode the same command is empty. I thought maybe there is something extra or missing on Zeppelin Spark Interpreter that doesn't allow the JAR being used in cluster mode.

This is how Spark UI reports my JAR in `client` mode:

spark.repl.local.jars   
file:/tmp/spark-3aadfe3c-8821-4dfe-875b-744c2e35a95a/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar 

spark.yarn.dist.jars   
hdfs://hadoop-master-1:8020/user/mpanahi/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar 

spark.yarn.secondary.jars 
graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar 

sun.java.command 
org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.executor.memory=5g --conf spark.driver.memory=8g --conf spark.driver.cores=4 --conf spark.yarn.isPython=true --conf spark.driver.extraClassPath=:/opt/zeppelin-0.8.2-new/interpreter/spark/:/opt/zeppelin-0.8.2-new/zeppelin-interpreter/target/lib/::/opt/zeppelin-0.8.2-new/zeppelin-interpreter/target/classes:/opt/zeppelin-0.8.2-new/zeppelin-interpreter/target/test-classes:/opt/zeppelin-0.8.2-new/zeppelin-zengine/target/test-classes:/opt/zeppelin-0.8.2-new/interpreter/spark/spark-interpreter-0.8.2-SNAPSHOT.jar --conf spark.useHiveContext=true --conf spark.app.name=Zeppelin --conf spark.executor.cores=5 --conf spark.submit.deployMode=client --conf spark.dynamicAllocation.maxExecutors=50 --conf spark.dynamicAllocation.initialExecutors=1 --conf spark.dynamicAllocation.enabled=true --conf spark.driver.extraJavaOptions= -Dfile.encoding=UTF-8 -Dlog4j.configuration=file:///opt/zeppelin-0.8.2-new/conf/log4j.properties -Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark-mpanahi-zeppelin-hadoop-gateway.log --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer --jars hdfs:///user/mpanahi/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar

This is how Spark UI reports my JAR in `cluster` mode (same configs as I mentioned above):

spark.repl.local.jars 	This field does not exist in cluster mode

spark.yarn.dist.jars 	
hdfs://hadoop-master-1:8020/user/mpanahi/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar

spark.yarn.secondary.jars	
graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar

sun.java.command	
org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer --jar file:/opt/zeppelin-0.8.2-new/interpreter/spark/spark-interpreter-0.8.2-SNAPSHOT.jar --arg 134.158.74.122 --arg 46130 --arg : --properties-file /yarn/nm/usercache/mpanahi/appcache/application_1547731772080_0077/container_1547731772080_0077_01_000001/_spark_conf/spark_conf_.properties

UPDATE: In Zeppelin 0.9.0, if I run this at the beginning, not only this JAR is accessible, but also all the JARs in --jar inside zeppelin-env.sh! If I don't do this it will fail as I mentioned before.

%spark.conf

spark.app.name multivac
spark.jars hdfs:///user/maziyar/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar

I kind of understand if graphframes becomes available even though it was already in the --jar, but this made the rest of my JARs also available means there is something here that pushes the others into the cluster.

Thank you.

Attachments

Issue Links

links to

GitHub Pull Request #3308

Activity

People

Assignee:: Jeff Zhang

Reporter:: Maziyar PANAHI

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Feb/19 11:24

Updated:: 29/Sep/19 08:40

Resolved:: 06/Mar/19 02:41

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2.5h