Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36121

Write data loss caused by stage retry when enable v2 FileOutputCommitter

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Auto Closed
    • 2.2.1, 3.0.1
    • None
    • Input/Output
    • None

    Description

      All our ETL scenarios are configured: mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed occurs, the stage retry is triggered, and then the zombie stage and the retry stage may write tasks of the same part at the same time, and their task directory and file name are exactly the same. This may cause data part loss due to conflicts between delete and rename operations.

      For example, this is also a data loss case I encountered recently: Stage 5.0 is a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry stage. They have two tasks concurrently writing the same part file: part-00298.

      1. The task of stage 5.1 has preemptively created part file: part-00298 and written data.
      2. At the same time as the task commit of stage 5.1, the task of sage 5.0 is going to create this part file to write data, because the file already exists, it throw an exception and delete the task's temporary directory.
      3. Then stage 5.0 starts commitTask, it will traverse the sub-directories and execute rename. At this time, because the file has been deleted, it finally moves empty without any exception, which causes data loss.

       

      I read this part of the code, and currently I think of two ideas:

      1. Add stageAttemptNumber to taskAttemptPath to avoid conflicts.
      2. Check the number of files after commitTask, and throw an exception directly when it is found to be missing.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            gaoyajun02 gaoyajun02
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: