Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-27985

Avoid duplicate files.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.0
    • None
    • Tez
    • None

    Description

      1 introducation
      Hive on Tez occasionally produces duplicated files, especially speculative execution is enable. Hive identifies and removes duplicate files through removeTempOrDuplicateFiles. However, this logic often does not take effect. For example, the killed task attempt may commit files during the execution of this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during union all. There are many issues to solve these problems, mainly focusing on how to identify duplicate files. This issue mainly solves this problem by avoiding the generation of duplicate files.

      2 How Tez avoids duplicate files?

      After testing, I found that Hadoop MapReduce examples and Tez examples do not have this problem. Through OutputCommitter, duplicate files can be avoided if designed properly. Let's analyze how Tez avoids duplicate files.

       Note: Compared with Tez, Hadoop MapReduce has one more commitPending, which is not critical, so only analyzing Tez.

       

      Let’s analyze this step:

      • (1) process records: Process records.
      • (2) send canCommit request: After all Records are processed, call canCommit remotely to AM.
      • (3) update commitAttempt: After AM receives the canCommit request, it will check whether there are other tasksattempts in the current task that have already executed canCommit. If there is no other taskattempt to execute canCommit first, return true. Otherwise return false. This ensures that only one taskattempt is committed for each task.
      • (4) return canCommit response: Task receives AM's response. If returns true, it means it can be committed. If false is returned, it means that another task attempt has already executed the commit first, and you cannot commit. The task will jump into (2) loop to execute canCommit until it is killed or other tasks fail.
      • (5) output.commit: Execute commit, specifically rename the generated temporary file to the final file.
      • (6) notify succeeded: Although the task has completed the final file, AM still needs to be notified that its work is completed. Therefore, AM needs to be notified through heartbeat that the current task attempt has been completed.

      There is a problem in the above steps. That is, if an exception occurs in the task after (5) and before (6), AM does not know that the Task attempt has been completed, so AM will still start a new task attempt, and the new task attempt will generate a new file, so It will cause duplication. I added code for randomly throwing exceptions between (5) and (6), and found that in fact, Tez example did not produce data duplication. Why? Mainly because the final file generated by which task attempt is the same is the same. When a new task attempt commits and finds that the final file exists (this file was generated by the previous task attempt), it will be deleted firstly, then renamed. Regardless of whether the previous task attempt was committed normally, the last successful task will clear the previous error results.

      To summarize, tez-examples uses two methods to avoid duplicate files:

      • (1) Avoid repeated commit through canCommit. This is particularly effective for tasks with speculative execution turned on.
      • (2) The final file names generated by different task attempts are the same. Combined with canCommit, it can be guaranteed that only one file generated in the end, and it can only be generated by a successful task attempt.

      3 Why can't Hive on Tez avoid duplicate files?

      Hive on Tez does not have the two mechanisms mentioned in the Tez example.
      First of all, Hive on Tez does not call canCommit.TezProcessor inherited from AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly in SimpleMRProcessor.
      Secondly, the file names generated for each file under Hive on Tez are not same. The file generated by the first attempt of a task is 000000_0, and the file generated by the second attempt is 000000_1.

      4 How to improve?

      Use canCommit to ensure that speculative tasks will not be submitted at the same time. (HIVE-27899)
      Let different task attempts for each task generate the same final file name. (HIVE-27986)

      Attachments

        1. how tez examples commit.png
          71 kB
          Chenyu Zheng

        Activity

          People

            zhengchenyu Chenyu Zheng
            zhengchenyu Chenyu Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: