Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46314

If Hadoop is not installed and configured, can the Spark cluster read and write OBS in standalone mode?

    XMLWordPrintableJSON

Details

    • IT Help
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.4.1
    • None
    • Connect, Input/Output, PySpark
    • None
    • Python3.8

      pyspark 3.4.1

      operating system:Ubuntu 20.04

    Description

      If Hadoop is not deployed, PySpark APIs read data from OBS buckets and convert the data to RDD. How can I achieve it?

      The following code reports an error: No FileSystem for scheme "obs",Can Spark read and write OBS without Hadoop installation and configuration?

      And I'm not familiar with pyspark. Is the code wrong?

      // code placeholder
      from pyspark import SparkConf
      from pyspark.sql import SparkSession
      
      conf = SparkConf()
      conf.set("spark.app.name", "read and write OBS")
      conf.set("spark.security.credentials.hbase.enabled", "true")
      conf.set("spark.hadoop.fs.obs.access.key", ak)
      conf.set("spark.hadoop.fs.obs.secret.key", sk)
      conf.set("spark.hadoop.fs.obs.endpoint", "http://xxx")
      spark = SparkSession.builder.config(conf=conf).getOrCreate()
      
      df = spark.read.json('obs://bucket_name/xxx.json')
      df.coalesce(2).write.json("obs://bucket_name/", "overwrite") 

      Attachments

        Activity

          People

            Unassigned Unassigned
            xueice Yuqing Xue
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: