Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14258

Reduce task timed out because CommonJoinOperator.genUniqueJoinObject took too long to finish without reporting progress

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.1.0
    • Query Processor
    • None
    • Reviewed

    Description

      Reduce task timed out because CommonJoinOperator.genUniqueJoinObject took too long to finish without reporting progress.
      This timeout happened when reducer.close() is called in ReduceTask.java.
      CommonJoinOperator.genUniqueJoinObject() called by reducer.close() will loop over every row in the AbstractRowContainer. This can take a long time if there are a large number or rows, and during this time, it does not report progress. If this runs for long enough more than "mapreduce.task.timeout", ApplicationMaster will kill the task for failing to report progress.
      we configured "mapreduce.task.timeout" as 10 minutes. I captured the stack trace in the 10 minutes before AM killed the reduce task at 2016-07-15 07:19:11.
      The following three stack traces can prove it:
      at 2016-07-15 07:09:42:

      "main" prio=10 tid=0x00007f90ec017000 nid=0xd193 runnable [0x00007f90f62e5000]
         java.lang.Thread.State: RUNNABLE
              at java.io.FileInputStream.readBytes(Native Method)
              at java.io.FileInputStream.read(FileInputStream.java:272)
              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:154)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
              - locked <0x00000007deecefb0> (a org.apache.hadoop.fs.BufferedFSInputStream)
              at java.io.DataInputStream.read(DataInputStream.java:149)
              at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:436)
              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:252)
              at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276)
              at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214)
              at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232)
              at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
              - locked <0x00000007deecb978> (a org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker)
              at java.io.DataInputStream.readFully(DataInputStream.java:195)
              at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70)
              at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:120)
              at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2359)
              - locked <0x00000007deec8f70> (a org.apache.hadoop.io.SequenceFile$Reader)
              at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2491)
              - locked <0x00000007deec8f70> (a org.apache.hadoop.io.SequenceFile$Reader)
              at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
              - locked <0x00000007deec82f0> (a org.apache.hadoop.mapred.SequenceFileRecordReader)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.nextBlock(RowContainer.java:360)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:267)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:74)
              at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:644)
              at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:750)
              at org.apache.hadoop.hive.ql.exec.JoinOperator.endGroup(JoinOperator.java:256)
              at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:284)
              at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453)
              at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
              at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
      

      at 2016-07-15 07:15:35

      "main" prio=10 tid=0x00007f90ec017000 nid=0xd193 runnable [0x00007f90f62e5000]
         java.lang.Thread.State: RUNNABLE
              at java.util.zip.CRC32.updateBytes(Native Method)
              at java.util.zip.CRC32.update(CRC32.java:65)
              at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:316)
              at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:279)
              at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214)
              at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232)
              at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
              - locked <0x00000007d68db510> (a org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker)
              at java.io.DataInputStream.readFully(DataInputStream.java:195)
              at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70)
              at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:120)
              at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2359)
              - locked <0x00000007d68d8b68> (a org.apache.hadoop.io.SequenceFile$Reader)
              at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2491)
              - locked <0x00000007d68d8b68> (a org.apache.hadoop.io.SequenceFile$Reader)
              at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
              - locked <0x00000007d68d7f08> (a org.apache.hadoop.mapred.SequenceFileRecordReader)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.nextBlock(RowContainer.java:360)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:267)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:74)
              at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:644)
              at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:750)
              at org.apache.hadoop.hive.ql.exec.JoinOperator.endGroup(JoinOperator.java:256)
              at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:284)
              at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453)
              at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
              at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
              at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
      

      at 2016-07-15 07:19:10

      "main" prio=10 tid=0x00007f90ec017000 nid=0xd193 runnable [0x00007f90f62e5000]
         java.lang.Thread.State: RUNNABLE
              at java.io.FileInputStream.readBytes(Native Method)
              at java.io.FileInputStream.read(FileInputStream.java:272)
              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:154)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
              - locked <0x00000007df731218> (a org.apache.hadoop.fs.BufferedFSInputStream)
              at java.io.DataInputStream.read(DataInputStream.java:149)
              at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:436)
              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:252)
              at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276)
              at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214)
              at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232)
              at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
              - locked <0x00000007df72dc20> (a org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker)
              at java.io.DataInputStream.readFully(DataInputStream.java:195)
              at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70)
              at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:120)
              at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2359)
              - locked <0x00000007df72b278> (a org.apache.hadoop.io.SequenceFile$Reader)
              at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2491)
              - locked <0x00000007df72b278> (a org.apache.hadoop.io.SequenceFile$Reader)
              at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
              - locked <0x00000007df72a618> (a org.apache.hadoop.mapred.SequenceFileRecordReader)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.nextBlock(RowContainer.java:360)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.nextBlock(RowContainer.java:373)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:267)
              at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:74)
              at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:644)
              at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:750)
              at org.apache.hadoop.hive.ql.exec.JoinOperator.endGroup(JoinOperator.java:256)
              at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:284)
              at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453)
              at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
              at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
              at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
      

      You can see all three stack traces show CommonJoinOperator.genUniqueJoinObject was called by ExecReducer.close.

      Attachments

        1. HIVE-14258.patch
          0.7 kB
          Zhihai Xu

        Activity

          People

            zxu Zhihai Xu
            zxu Zhihai Xu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: