Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6389

[Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 0.14.1
    • None
    • Java, Python
    • None
    • Hadoop 2.85
      EMR 5.24.1
      python version: 3.7.4
      skein version: 0.8.0

    Description

      I can't access hdfs through pyarrow ( from inside a yarn container created by skein)

      This code works in a jupyter notebook running on the master node, or in an ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var:

      ```import pyarrow; pyarrow.hdfs.connect()```

       

      However, when running on yarn by submitting the following skein application, I get a Java error.

       

      {{name: test_conn
      queue: default

      master:
      env:
      ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native
      JAVA_HOME: /etc/alternatives/jre
      resources:
      vcores: 1
      memory: 10 GiB
      files:
      conda_env: /home/hadoop/environment.tar.gz
      script: |
      echo $HADOOP_HOME
      echo $JAVA_HOME
      echo $HADOOP_CLASSPATH
      echo $ARROW_LIBHDFS_DIR
      source conda_env/bin/activate
      python -c "import pyarrow; pyarrow.hdfs.connect(); print(fs.open('test.txt').read())"
      echo "Hello World!"}}

      FYI I tried with/without all those extra env vars, to no effect. I also tried modifying the EMR cluster with any of the following

       

      {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs"
      "fs.AbstractFileSystem.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"
      "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}}

      The fs.AbstractFileSystem.hdfs.impl one gave a slightly different error- it was able to find which class by name to use for the "hdfs://" prefix, namely org.apache.hadoop.hdfs.DistributedFileSystem, but not able to find that class.

      Logs:

       

      {{=========================================================================================
      LogType:application.driver.log
      Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
      LogLength:2635
      Log Contents:
      /usr/lib/hadoop
      /usr/lib/jvm/java-openjdk
      :/usr/lib/hadoop-lzo/lib/:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/

      hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
      java.io.IOException: No FileSystem for scheme: hdfs
      at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846)
      at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
      at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
      at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
      at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884)
      at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439)
      at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414)
      at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
      at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411)
      Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 215, in connect
      extra_conf=extra_conf)
      File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 40, in _init_
      self._connect(host, port, user, kerb_ticket, driver, extra_conf)
      File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
      File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
      pyarrow.lib.ArrowIOError: HDFS connection failed
      Hello World!
      End of LogType:application.driver.log

      LogType:application.master.log
      Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
      LogLength:1588
      Log Contents:
      19/08/29 20:51:55 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
      19/08/29 20:51:55 INFO skein.ApplicationMaster: Running as user hadoop
      19/08/29 20:51:55 INFO skein.ApplicationMaster: Application specification successfully loaded
      19/08/29 20:51:56 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8030
      19/08/29 20:51:56 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
      19/08/29 20:51:56 INFO skein.ApplicationMaster: gRPC server started at IP.ec2.internal:39361
      19/08/29 20:51:57 INFO skein.ApplicationMaster: WebUI server started at IP.ec2.internal:36511
      19/08/29 20:51:57 INFO skein.ApplicationMaster: Registering application with resource manager
      19/08/29 20:51:57 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8032
      19/08/29 20:51:57 INFO skein.ApplicationMaster: Starting application driver
      19/08/29 20:51:57 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully.
      19/08/29 20:51:57 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
      19/08/29 20:51:57 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
      19/08/29 20:51:58 INFO skein.ApplicationMaster: Deleted application directory hdfs://IP.ec2.internal:8020/user/hadoop/.skein/application_1567110830725_0001
      19/08/29 20:51:58 INFO skein.ApplicationMaster: WebUI server shut down
      19/08/29 20:51:58 INFO skein.ApplicationMaster: gRPC server shut down
      End of LogType:application.master.log}}

      Attachments

        Activity

          People

            Unassigned Unassigned
            bschreck Ben Schreck
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: