Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14344

Intermittent failures caused by leaking delegation tokens

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 1.2.1, 2.1.0
    • None
    • Tez
    • None

    Description

      We have experienced random job failures caused by leaking delegation tokens. The Tez child task will fail because it is attempting to read from the delegation tokens directory of a different (related) task.

      Failure results in the following type of stack trace:

      2016-07-21 16:57:18,061 [FATAL] [TezChild] |tez.ReduceRecordSource|: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) 
      	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
      	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
      	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:249)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
      	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
      	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
      	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
      	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
      	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
      	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.io.IOException: Exception reading file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens
      	at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first(RowContainer.java:237)
      	at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first(RowContainer.java:74)
      	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:650)
      	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:756)
      	at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:316)
      	at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:279)
      	at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:272)
      	at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:258)
      	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
      	... 17 more
      Caused by: java.lang.RuntimeException: java.io.IOException: Exception reading file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens
      	at org.apache.hadoop.mapreduce.security.TokenCache.mergeBinaryTokens(TokenCache.java:141)
      	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:119)
      	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
      	at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
      	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
      	at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
      	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
      	at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first(RowContainer.java:222)
      	... 25 more
      Caused by: java.io.IOException: Exception reading file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens
      	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:175)
      	at org.apache.hadoop.mapreduce.security.TokenCache.mergeBinaryTokens(TokenCache.java:136)
      	... 32 more
      Caused by: java.io.FileNotFoundException: File file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens does not exist
      	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
      	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
      	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
      	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
      	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
      	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
      	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
      	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:170)
      	... 33 more
      

      The application that failed was application_1468602386465_489844 while complaining about appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens.

      This seems to only manifest via HiveAction through Oozie.

      Attachments

        1. HIVE-14344.patch
          2 kB
          Chris Drome
        2. HIVE-14344-branch-1.patch
          2 kB
          Chris Drome

        Activity

          People

            cdrome Chris Drome
            cdrome Chris Drome
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: