Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24736

--py-files not functional for non local URLs. It appears to pass non-local URL's into PYTHONPATH directly.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.4.0
    • 3.0.0
    • Kubernetes, PySpark, Spark Core
    • None
    • Recent 2.4.0 from master branch, submitted on Linux to a KOPS Kubernetes cluster created on AWS.

       

    Description

      My spark-submit
      bin/spark-submit \
              --master k8s://https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com \
              --deploy-mode cluster \
              --name pytest \
              --conf spark.kubernetes.container.image=412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest \
              --conf spark.kubernetes.driver.pod.name=spark-pi-driver \
              --conf spark.kubernetes.authenticate.submission.caCertFile=cluster.ca \
              --conf spark.kubernetes.authenticate.submission.oauthToken=$TOK \
              --conf spark.kubernetes.authenticate.driver.oauthToken=$TOK \
      --py-files "https://s3.amazonaws.com/maxar-ids-fids/screw.zip" \
      https://s3.amazonaws.com/maxar-ids-fids/it.py
       
      screw.zip is successfully downloaded and placed in SparkFIles.getRootPath()
      2018-07-01 07:33:43 INFO  SparkContext:54 - Added file https://s3.amazonaws.com/maxar-ids-fids/screw.zip at https://s3.amazonaws.com/maxar-ids-fids/screw.zip with timestamp 1530430423297
      2018-07-01 07:33:43 INFO  Utils:54 - Fetching https://s3.amazonaws.com/maxar-ids-fids/screw.zip to /var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240/fetchFileTemp1549645948768432992.tmp
      I print out the  PYTHONPATH and PYSPARK_FILES environment variables from the driver script:
           PYTHONPATH /opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-0.10.7-src.zip:/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-.zip:https://s3.amazonaws.com/maxar-ids-fids/screw.zip*
          PYSPARK_FILES https://s3.amazonaws.com/maxar-ids-fids/screw.zip
       
      I print out sys.path
      '/tmp/spark-fec3684b-8b63-4f43-91a4-2f2fa41a1914', u'/var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240', '/opt/spark/python/lib/pyspark.zip', '/opt/spark/python/lib/py4j-0.10.7-src.zip', '/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar', '/opt/spark/python/lib/py4j-.zip', '/opt/spark/work-dir/https', '//[s3.amazonaws.com/maxar-ids-fids/screw.zip
       
      URL from PYTHONFILES gets placed in sys.path verbatim with obvious results.
       
      Dump of spark config from container.
      Spark config dumped:
      [(u'spark.master', u'k8s://https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com'), (u'spark.kubernetes.authenticate.submission.oauthToken', u'<present_but_redacted>'), (u'spark.kubernetes.authenticate.driver.oauthToken', u'<present_but_redacted>'), (u'spark.kubernetes.executor.podNamePrefix', u'pytest-1530430411996'), (u'spark.kubernetes.memoryOverheadFactor', u'0.4'), (u'spark.driver.blockManager.port', u'7079'), (u'spark.app.id', u'spark-application-1530430424433'), (u'spark.app.name', u'pytest'), (u'spark.executor.id', u'driver'), (u'spark.driver.host', u'pytest-1530430411996-driver-svc.default.svc'), (u'spark.kubernetes.container.image', u'412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest'), (u'spark.driver.port', u'7078'), (u'spark.kubernetes.python.mainAppResource', u'https://s3.amazonaws.com/maxar-ids-fids/it.py'), (u'spark.kubernetes.authenticate.submission.caCertFile', u'cluster.ca'), (u'spark.rdd.compress', u'True'), (u'spark.driver.bindAddress', u'100.120.0.1'), (u'spark.kubernetes.driver.pod.name', u'spark-pi-driver'), (u'spark.serializer.objectStreamReset', u'100'), (u'spark.files', u'https://s3.amazonaws.com/maxar-ids-fids/it.py,https://s3.amazonaws.com/maxar-ids-fids/screw.zip'), (u'spark.kubernetes.python.pyFiles', u'https://s3.amazonaws.com/maxar-ids-fids/screw.zip'), (u'spark.kubernetes.authenticate.driver.mounted.oauthTokenFile', u'/mnt/secrets/spark-kubernetes-credentials/oauth-token'), (u'spark.submit.deployMode', u'client'), (u'spark.kubernetes.submitInDriver', u'true')]
       

      Attachments

        Issue Links

          Activity

            People

              vanzin Marcelo Masiero Vanzin
              jonathan.weaver Jonathan A Weaver
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: