Description
The extractObjectCache in UDFJson is increased over limitation(CACHE_SIZE = 16). When multiple queries are running concurrently on HS2 local(not mr/tez) with get_json_object or get_json_tuple
HS2 heap_dump
Object at 0x515ab18f8 instance of org.apache.hadoop.hive.ql.udf.UDFJson$HashCache@0x515ab18f8 (77 bytes) Class: class org.apache.hadoop.hive.ql.udf.UDFJson$HashCache Instance data members: accessOrder (Z) : false entrySet (L) : <null> hashSeed (I) : 0 header (L) : java.util.LinkedHashMap$Entry@0x515a577d0 (60 bytes) keySet (L) : <null> loadFactor (F) : 0.6 modCount (I) : 4741146 size (I) : 2733158 <========== here!! table (L) : [Ljava.util.HashMap$Entry;@0x7163d8b70 (67108880 bytes) threshold (I) : 5033165 values (L) : <null> References to this object:
I think that this problem be caused by the LinkedHashMap object is not thread-safe
* <p><strong>Note that this implementation is not synchronized.</strong> * If multiple threads access a linked hash map concurrently, and at least * one of the threads modifies the map structurally, it <em>must</em> be * synchronized externally. This is typically accomplished by * synchronizing on some object that naturally encapsulates the map.
Reproduce :
- Multiple queries are running with get_json_object and small input data(for execution on hs2 local mode)
- jvm heap dump & analyze
test scenario
Multiple queries are running with get_json_object and small input data(for execute on hs2 local mode) 1.hql : SELECT get_json_object(body, '$.fileSize'), get_json_object(body, '$.ps_totalTimeSeconds'), get_json_object(body, '$.totalTimeSeconds') FROM xxx.tttt WHERE part_hour='2016040105' 2.hql : SELECT get_json_object(body, '$.fileSize'), get_json_object(body, '$.ps_totalTimeSeconds'), get_json_object(body, '$.totalTimeSeconds') FROM xxx.tttt WHERE part_hour='2016040106' 3.hql : SELECT get_json_object(body, '$.fileSize'), get_json_object(body, '$.ps_totalTimeSeconds'), get_json_object(body, '$.totalTimeSeconds') FROM xxx.tttt WHERE part_hour='2016040107' 4.hql : SELECT get_json_object(body, '$.fileSize'), get_json_object(body, '$.ps_totalTimeSeconds'), get_json_object(body, '$.totalTimeSeconds') FROM xxx.tttt WHERE part_hour='2016040108' run.sh : t_cnt=0 while true do echo "query executing..." for i in 1 2 3 4 do beeline -u jdbc:hive2://localhost:10000 -n hive --silent=true -f $i.hql > $i.log 2>&1 & done wait t_cnt=`expr $t_cnt + 1` echo "query count : $t_cnt" sleep 2 done jvm heap dump & analyze : jmap -dump:format=b,file=hive.dmp $PID jhat -J-mx48000m -port 8080 hive.dmp &
Finally I have attached our patch.
Attachments
Attachments
Issue Links
- duplicates
-
HIVE-16196 UDFJson having thread-safety issues
- Resolved