[SPARK-46314] If Hadoop is not installed and configured, can the Spark cluster read and write OBS in standalone mode? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: IT Help
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.4.1
Fix Version/s: None
Component/s: Connect, Input/Output, PySpark
Labels:
None
Environment:

Python3.8

pyspark 3.4.1

operating system:Ubuntu 20.04

Language:
- Python

Description

If Hadoop is not deployed, PySpark APIs read data from OBS buckets and convert the data to RDD. How can I achieve it?

The following code reports an error: No FileSystem for scheme "obs",Can Spark read and write OBS without Hadoop installation and configuration?

And I'm not familiar with pyspark. Is the code wrong?

// code placeholder
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
conf.set("spark.app.name", "read and write OBS")
conf.set("spark.security.credentials.hbase.enabled", "true")
conf.set("spark.hadoop.fs.obs.access.key", ak)
conf.set("spark.hadoop.fs.obs.secret.key", sk)
conf.set("spark.hadoop.fs.obs.endpoint", "http://xxx")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

df = spark.read.json('obs://bucket_name/xxx.json')
df.coalesce(2).write.json("obs://bucket_name/", "overwrite")

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Yuqing Xue

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 08/Dec/23 01:40

Updated:: 08/Dec/23 01:49