Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31177

DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has non-".gz" extension

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.4
    • None
    • Input/Output

    Description

      i have large CSV files that are gzipped and uploaded to S3 with Content-Encoding=gzip. the files have file extension ".csv", as most web clients will automatically decompress the file based on the Content-Encoding header. using pyspark to read these CSV files does not mimic this behavior.

      works as expected:

      df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
      

      does not decompress and tries to load entire contents of file as the first row:

      df = spark.read.csv('s3://bucket/large.csv', header=True)
      

      it looks like it's relying on the file extension to determine if the file is gzip compressed or not. it would be great if S3 resources, and any other http based resources, could consult the Content-Encoding response header as well.

      i tried to find the code that determines this, but i'm not familiar with the code base. any pointers would be helpful. and i can look into fixing it.

      Attachments

        Activity

          People

            Unassigned Unassigned
            markwaddle Mark Waddle
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: