Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
4.0.0-alpha-1
Description
While investigating an issue where HMS becomes unresponsive we noticed a lot of failed attempts from the compaction Cleaner thread to remove obsolete directories with exceptions similar to the one below.
2022-06-16 05:48:24,819 ERROR org.apache.hadoop.hive.ql.txn.compactor.Cleaner: [Cleaner-executor-thread-0]: Caught exception when cleaning, unable to complete cleaning of id:4410976,dbname:my_database,tableName:my_table,partName:day=20220502,state:,type:MAJOR,enqueueTime:0,start:0,properties:null,runAs:some_user,tooManyAborts:false,hasOldAbort:false,highestWriteId:187502,errorMessage:null java.io.IOException: Not enough history available for (187502,x). Oldest available base: hdfs://nameservice1/warehouse/tablespace/managed/hive/my_database.db/my_table/day=20220502/base_0188687_v4297872 at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1432) at org.apache.hadoop.hive.ql.txn.compactor.Cleaner.removeFiles(Cleaner.java:261) at org.apache.hadoop.hive.ql.txn.compactor.Cleaner.access$000(Cleaner.java:71) at org.apache.hadoop.hive.ql.txn.compactor.Cleaner$1.run(Cleaner.java:203) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) at org.apache.hadoop.hive.ql.txn.compactor.Cleaner.clean(Cleaner.java:200) at org.apache.hadoop.hive.ql.txn.compactor.Cleaner.lambda$run$0(Cleaner.java:105) at org.apache.hadoop.hive.ql.txn.compactor.CompactorUtil$ThrowingRunnable.lambda$unchecked$0(CompactorUtil.java:54) at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
In addition the logs contained a large number of long JVM pauses as shown below and the HMS (RSZ) memory kept increasing at rate of 90MB per hour.
2022-06-16 16:17:17,805 WARN org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 34346ms 2022-06-16 16:17:21,497 INFO org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 1690ms 2022-06-16 16:17:57,696 WARN org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 34697ms 2022-06-16 16:18:01,326 INFO org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 1628ms 2022-06-16 16:18:37,280 WARN org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 34453ms 2022-06-16 16:18:40,927 INFO org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 1646ms 2022-06-16 16:19:16,929 WARN org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 33997ms 2022-06-16 16:19:20,572 INFO org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 1637ms 2022-06-16 16:20:01,643 WARN org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 39329ms 2022-06-16 16:20:05,572 INFO org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: Detected pause in JVM or host machine (eg GC): pause of approximately 1927ms
We took a heapdump of the HMS around the time that it becomes unresponsive and we have seen many Configuration objects (~40K) occupying more than 90% of the current heap (~9GB).
Class Name | Objects | Shallow Heap | Retained Heap ---------------------------------------------------------------------------------------------- org.apache.hadoop.conf.Configuration | 39,452 | 1,893,696 | >= 8,560,573,960 java.util.concurrent.ConcurrentHashMap | 155,863 | 9,975,232 | >= 4,696,003,968 java.util.concurrent.ConcurrentHashMap$Node[]| 139,348 | 1,312,967,944 | >= 4,686,230,296 java.util.Properties | 87,119 | 4,181,712 | >= 4,193,638,904 java.util.Hashtable$Entry[] | 87,840 | 987,968,472 | >= 4,189,518,928 java.util.concurrent.ConcurrentHashMap$Node | 99,097,078 | 3,171,106,496 | >= 3,375,319,552 java.util.Hashtable$Entry | 100,047,081 | 3,201,506,592 | >= 3,201,551,936 org.postgresql.jdbc.PgConnection | 6,488 | 830,464 | >= 551,442,952 ----------------------------------------------------------------------------------------------
It turns out that these Configuration objects are all referenced by CACHE entries in org.apache.hadoop.fs.FileSystem$Cache.
Class Name | Shallow Heap | Retained Heap ---------------------------------------------------------------------------------------------------------------------- org.apache.hadoop.fs.FileSystem$Cache @ 0x45403fe70 | 32 | 108,671,824 |- <class> class org.apache.hadoop.fs.FileSystem$Cache @ 0x45410c3e0 | 8 | 544 '- map java.util.HashMap @ 0x453ffb598 | 48 | 92,777,232 |- <class> class java.util.HashMap @ 0x4520382c8 System Class | 40 | 168 |- entrySet java.util.HashMap$EntrySet @ 0x454077848 | 16 | 16 '- table java.util.HashMap$Node[32768] @ 0x463585b68 | 131,088 | 92,777,168 |- class java.util.HashMap$Node[] @ 0x4520b7790 | 0 | 0 '- [1786] java.util.HashMap$Node @ 0x451998ce0 | 32 | 9,968 |- <class> class java.util.HashMap$Node @ 0x4520b7728 System Class | 8 | 32 '- value org.apache.hadoop.hdfs.DistributedFileSystem @ 0x452990178 | 56 | 4,976 |- <class> class org.apache.hadoop.hdfs.DistributedFileSystem @ 0x45402e290| 8 | 4,664 |- uri java.net.URI @ 0x451a05cd0 hdfs://nameservice1 | 80 | 432 |- dfs org.apache.hadoop.hdfs.DFSClient @ 0x451f5d9b8 | 128 | 3,824 '- conf org.apache.hadoop.hive.conf.HiveConf @ 0x453a34b38 | 80 | 250,160 ----------------------------------------------------------------------------------------------------------------------
As long as they are in the CACHE they cannot be garbage collected so this leads to a memory leak.
The memory leak seems to come from the fact the compaction Cleaner attempts to remove the obsolete files and fails. The exception does not allow the filesystem cleanup to take place so we are leaving filesystem entries in the CACHE and subsequently configuration objects.
Although, the HMS unresponsiveness in this use-case may not be due to lack of memory the leak needs to be addressed to avoid hitting OOM.
Attachments
Issue Links
- links to