Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28903

Fix AWS JDK version conflict that breaks Pyspark Kinesis tests

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.3, 3.0.0
    • Fix Version/s: 2.4.5, 3.0.0
    • Component/s: Structured Streaming
    • Labels:
      None
    • Target Version/s:

      Description

      The Pyspark Kinesis tests are failing, at least in master:

      ======================================================================
      ERROR: test_kinesis_stream (pyspark.streaming.tests.test_kinesis.KinesisStreamTests)
      ----------------------------------------------------------------------
      Traceback (most recent call last):
        File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/streaming/tests/test_kinesis.py", line 44, in test_kinesis_stream
          kinesisTestUtils = self.ssc._jvm.org.apache.spark.streaming.kinesis.KinesisTestUtils(2)
        File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__
          answer, self._gateway_client, None, self._fqn)
        File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
          format(target_id, ".", name), value)
      Py4JJavaError: An error occurred while calling None.org.apache.spark.streaming.kinesis.KinesisTestUtils.
      : java.lang.NoSuchMethodError: com.amazonaws.regions.Region.getAvailableEndpoints()Ljava/util/Collection;
      	at org.apache.spark.streaming.kinesis.KinesisTestUtils$.$anonfun$getRegionNameByEndpoint$1(KinesisTestUtils.scala:211)
      	at org.apache.spark.streaming.kinesis.KinesisTestUtils$.$anonfun$getRegionNameByEndpoint$1$adapted(KinesisTestUtils.scala:211)
      	at scala.collection.Iterator.find(Iterator.scala:993)
      	at scala.collection.Iterator.find$(Iterator.scala:990)
      	at scala.collection.AbstractIterator.find(Iterator.scala:1429)
      	at scala.collection.IterableLike.find(IterableLike.scala:81)
      	at scala.collection.IterableLike.find$(IterableLike.scala:80)
      	at scala.collection.AbstractIterable.find(Iterable.scala:56)
      	at org.apache.spark.streaming.kinesis.KinesisTestUtils$.getRegionNameByEndpoint(KinesisTestUtils.scala:211)
      	at org.apache.spark.streaming.kinesis.KinesisTestUtils.<init>(KinesisTestUtils.scala:46)
      ...
      

      The non-Python Kinesis tests are fine though. It turns out that this is because Pyspark tests use the output of the Spark assembly, and it pulls in hadoop-cloud, which in turn pulls in an old AWS Java SDK.

      Per Steve Loughran, it seems like we can just resolve this by excluding the aws-java-sdk dependency. See the attached PR for some more detail about the debugging and other options.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                srowen Sean R. Owen
                Reporter:
                srowen Sean R. Owen
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: