Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28558

DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.4.3
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None
    • Environment:

      Hadoop 2.7

      Scala 2.11

      Tested:

      • Spark 2.3.3 - Works
      • Spark 2.4.x - All have the same issue

      Description

      When writing a parquet using partitionBy the group file permissions are being changed as shown below. This causes members of the group to get "org.apache.hadoop.security.AccessControlException: Open failed for file.... error: Permission denied (13)"

      This worked in 2.3. I found a workaround which was to set "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives the correct behaviour

       

      Code I used to reproduce issue:

      Seq(("H", 1), ("I", 2))
      .toDF("Letter", "Number")
      .write
      .partitionBy("Letter")
      .parquet(...)

       

      sparktesting$ tree -dp

      ├── [drwxrws---]  letter_testing2.3-defaults

      │   ├── [drwxrws---]  Letter=H

      │   └── [drwxrws---]  Letter=I

      ├── [drwxrws---]  letter_testing2.4-defaults

      │   ├── [drwxrwS---]  Letter=H

      │   └── [drwxrwS---]  Letter=I

      └── [drwxrws---]  letter_testing2.4-file-writer2

          ├── [drwxrws---]  Letter=H

          └── [drwxrws---]  Letter=I

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              spearson Stephen Pearson
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: