Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.7.2, 2.8.3, 2.7.5, 3.0.1, 3.1.1
-
None
-
None
Description
the Yarn task Excute failed , because excessive number of files under the path yarn.nodemanager.local-dirs causes Inode to run out and calculates task failure
check the NM Logs , found that many localized dirs delete failed because of user not found in security Systerm.
2018-12-21 06:06:40,723 | INFO | AsyncDispatcher event handler | Cache Size Before Clean: 240859897, Total Deleted: 85003, Public Deleted: 0, Private Deleted: 85003 | ResourceLocalizationService.java:522
2018-12-21 06:06:40,744 | ERROR | DeletionService #1 | DeleteAsUser for /srv/BigData/hadoop/data1/nm/localdir/usercache/odaeuser/filecache/48339 returned with exit code: 255 | LinuxContainerExecutor.java:565
ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:664)
at org.apache.hadoop.util.Shell.run(Shell.java:553)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:866)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:559)
at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:276)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-12-21 06:06:40,744 | ERROR | DeletionService #1 | Output from LinuxContainerExecutor's deleteAsUser follows: | LinuxContainerExecutor.java:567
2018-12-21 06:06:40,744 | INFO | DeletionService #1 | main : command provided 3 | ContainerExecutor.java:322
2018-12-21 06:06:40,744 | INFO | DeletionService #1 | main : run as user is odaeuser | ContainerExecutor.java:322
2018-12-21 06:06:40,744 | INFO | DeletionService #1 | main : requested yarn user is odaeuser | ContainerExecutor.java:322
2018-12-21 06:06:40,744 | INFO | DeletionService #1 | User odaeuser not found | ContainerExecutor.java:322
2018-12-21 06:06:40,745 | INFO | DeletionService #1 | Deleting absolute path : /srv/BigData/hadoop/data1/nm/localdir/usercache/odaeuser/filecache/48342 | LinuxContainerExecutor.java:543
2018-12-21 06:06:40,749 | ERROR | DeletionService #2 | DeleteAsUser for /srv/BigData/hadoop/data1/nm/localdir/usercache/odaeuser/filecache/48334 returned with exit code: 255 | LinuxContainerExecutor.java:565
ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:664)
at org.apache.hadoop.util.Shell.run(Shell.java:553)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:866)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:559)
at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:276)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-12-21 06:06:40,749 | ERROR | DeletionService #2 | Output from LinuxContainerExecutor's deleteAsUser follows: | LinuxContainerExecutor.java:567
2018-12-21 06:06:40,749 | INFO | DeletionService #2 | main : command provided 3 | ContainerExecutor.java:322
2018-12-21 06:06:40,749 | INFO | DeletionService #2 | main : run as user is odaeuser | ContainerExecutor.java:322
2018-12-21 06:06:40,749 | INFO | DeletionService #2 | main : requested yarn user is odaeuser | ContainerExecutor.java:322
2018-12-21 06:06:40,749 | INFO | DeletionService #2 | User odaeuser not found | ContainerExecutor.java:322
actually the local dir files's size is 4.4GB, not 240859897B print in the log
The user not found is because of our userInfo is saved in Ldap DB , when Ldap Service have problem at some time , then get the user info will fail(not because the user is deleted).When the Ldap Server recovery at some time , the user info can get .
The problem is even we can get the user info later , the dirs that deleted failed before will never be deleted later (it is deleted from the tracker list ), this cause the dirs accumulation .
I think NM ResourceLocalizationService should determine whether the file was deleted successfully by Deletion Service Thread before deleting the directory from tracker list and levelDB,if deleted failed ,we should add back it to tracker list ,then delete the next dirs till the local dirs size is below yarn.nodemanager.localizer.cache.target-size-mb
.