Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.1.0
-
None
Description
If you submit an app that is supposed to run for > 7 days (so using -\principal / --keytab in cluster mode), and there's a failure that causes the driver to restart after 7 days (that being the default token lifetime for HDFS), the new driver will fail with an error like the following:
Exception in thread "main" org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (lots of uninteresting token info) can't be found in cache at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123) at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253) at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:160)
Note: lines may not align with actual Apache code because that's our internal build.
This happens because as part of the app submission, the launcher provides delegation tokens to be used by the AM (=driver in this case), and those are expired at that point in time.
Attachments
Issue Links
- is duplicated by
-
SPARK-27515 [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired
- Resolved
- is related to
-
SPARK-27891 Long running spark jobs fail because of HDFS delegation token expires
- Resolved
-
SPARK-23781 Merge YARN and Mesos token renewal code
- Resolved
- links to