Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
0.14.1
-
None
-
None
-
Hadoop 2.85
EMR 5.24.1
python version: 3.7.4
skein version: 0.8.0
Description
I can't access hdfs through pyarrow ( from inside a yarn container created by skein)
This code works in a jupyter notebook running on the master node, or in an ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var:
```import pyarrow; pyarrow.hdfs.connect()```
However, when running on yarn by submitting the following skein application, I get a Java error.
{{name: test_conn
queue: default
master:
env:
ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native
JAVA_HOME: /etc/alternatives/jre
resources:
vcores: 1
memory: 10 GiB
files:
conda_env: /home/hadoop/environment.tar.gz
script: |
echo $HADOOP_HOME
echo $JAVA_HOME
echo $HADOOP_CLASSPATH
echo $ARROW_LIBHDFS_DIR
source conda_env/bin/activate
python -c "import pyarrow; pyarrow.hdfs.connect(); print(fs.open('test.txt').read())"
echo "Hello World!"}}
FYI I tried with/without all those extra env vars, to no effect. I also tried modifying the EMR cluster with any of the following
{{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs"
"fs.AbstractFileSystem.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"
"fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}}
The fs.AbstractFileSystem.hdfs.impl one gave a slightly different error- it was able to find which class by name to use for the "hdfs://" prefix, namely org.apache.hadoop.hdfs.DistributedFileSystem, but not able to find that class.
Logs:
{{=========================================================================================
LogType:application.driver.log
Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
LogLength:2635
Log Contents:
/usr/lib/hadoop
/usr/lib/jvm/java-openjdk
:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/
hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439)
at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414)
at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411)
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 215, in connect
extra_conf=extra_conf)
File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 40, in _init_
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed
Hello World!
End of LogType:application.driver.log
LogType:application.master.log
Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
LogLength:1588
Log Contents:
19/08/29 20:51:55 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
19/08/29 20:51:55 INFO skein.ApplicationMaster: Running as user hadoop
19/08/29 20:51:55 INFO skein.ApplicationMaster: Application specification successfully loaded
19/08/29 20:51:56 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8030
19/08/29 20:51:56 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/08/29 20:51:56 INFO skein.ApplicationMaster: gRPC server started at IP.ec2.internal:39361
19/08/29 20:51:57 INFO skein.ApplicationMaster: WebUI server started at IP.ec2.internal:36511
19/08/29 20:51:57 INFO skein.ApplicationMaster: Registering application with resource manager
19/08/29 20:51:57 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8032
19/08/29 20:51:57 INFO skein.ApplicationMaster: Starting application driver
19/08/29 20:51:57 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully.
19/08/29 20:51:57 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
19/08/29 20:51:57 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
19/08/29 20:51:58 INFO skein.ApplicationMaster: Deleted application directory hdfs://IP.ec2.internal:8020/user/hadoop/.skein/application_1567110830725_0001
19/08/29 20:51:58 INFO skein.ApplicationMaster: WebUI server shut down
19/08/29 20:51:58 INFO skein.ApplicationMaster: gRPC server shut down
End of LogType:application.master.log}}