Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
3.2.0
-
None
-
None
Description
Spark 3.2.1-RC2
During write.json spark deletes columns with all Null as default.
Spark does have dropFieldIfAllNull false as default, according to https://spark.apache.org/docs/latest/sql-data-sources-json.html
from pyspark import pandas as ps import re import numpy as np import os import pandas as pd from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr from pyspark.sql.types import StructType, StructField, StringType,IntegerType os.environ["PYARROW_IGNORE_TIMEZONE"]="1" def get_spark_session(app_name: str, conf: SparkConf): conf.setMaster('local[*]') conf \ .set('spark.driver.memory', '64g')\ .set("fs.s3a.access.key", "minio") \ .set("fs.s3a.secret.key", "") \ .set("fs.s3a.endpoint", "http://192.168.1.127:9000") \ .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ .set("spark.hadoop.fs.s3a.path.style.access", "true") \ .set("spark.sql.repl.eagerEval.enabled", "True") \ .set("spark.sql.adaptive.enabled", "True") \ .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \ .set("sc.setLogLevel", "error") return SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() spark = get_spark_session("Falk", SparkConf()) d3 = spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") import pyspark def sparkShape(dataFrame): return (dataFrame.count(), len(dataFrame.columns)) pyspark.sql.dataframe.DataFrame.shape = sparkShape print(d3.shape()) (653610, 267) d3.write.json("d3.json") d3 = spark.read.json("d3.json/*.json") import pyspark def sparkShape(dataFrame): return (dataFrame.count(), len(dataFrame.columns)) pyspark.sql.dataframe.DataFrame.shape = sparkShape print(d3.shape()) (653610, 186)
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-29444 Add configuration to support JacksonGenrator to keep fields with null values
- Resolved
- links to