Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15559

Clarity on Spark compatibility with hadoop-aws

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • None
    • documentation, fs/s3
    • None

    Description

      I'm the maintainer of Flintrock, a command-line tool for launching Apache Spark clusters on AWS. One of the things I try to do for my users is make it straightforward to use Spark with s3a://. I do this by recommending that users start Spark with the hadoop-aws package.

      For example:

      pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
      

      I'm struggling, however, to understand what versions of hadoop-aws should work with what versions of Spark.

      Spark releases are built against Hadoop 2.7. At the same time, I've been told that I should be able to use newer versions of Hadoop and Hadoop libraries with Spark, so for example, running Spark built against Hadoop 2.7 alongside HDFS 2.8 should work, and there is no need to build Spark explicitly against Hadoop 2.8.

      I'm having trouble translating this mental model into recommendations for how to pair Spark with hadoop-aws.

      For example, Spark 2.3.1 built against Hadoop 2.7 works with hadoop-aws:2.7.6 but not with hadoop-aws:2.8.4. Trying the latter yields the following error when I try to access files via s3a://.

      py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
      : java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
      at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
      at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
      at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
      at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
      at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
      at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
      at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
      at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
      at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
      at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
      at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
      at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
      at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      at py4j.Gateway.invoke(Gateway.java:282)
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      at py4j.commands.CallCommand.execute(CallCommand.java:79)
      at py4j.GatewayConnection.run(GatewayConnection.java:238)
      at java.lang.Thread.run(Thread.java:748)

      So it would seem that hadoop-aws must be matched to the same MAJOR.MINOR release of Hadoop that Spark is built against. However, neither this page nor this one shed any light on how to pair the correct version of hadoop-aws with Spark.

      Would it be appropriate to add some guidance somewhere on what versions of hadoop-aws work with what versions and builds of Spark? It would help eliminate this kind of guesswork and slow spelunking.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nchammas Nicholas Chammas
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: