[SPARK-22264] History server will be unavailable if there is an event log file with large size - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: Deploy
Labels:
None

Description

History server will be unavailable if there is an event log file with large size.
Large size here means the replaying time is too long.
We can fix this to add a timeout for event log replaying.
Here is an example:
Every application submitted after restart can not open history ui.

From event log directory we can find an event log file size is bigger than 130GB.

hadoop *144149840801* 2017-08-29 14:03 /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress

and from jstack and server log we can see replaying task blocked on this event log:
server log:

2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: Replaying log path: hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: Begin to replay hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!

jstack

"log-replay-executor-0" daemon prio=10 tid=0x00007f0f48014800 nid=0x6160 runnable [0x00007f0f4f6f5000]
   java.lang.Thread.State: RUNNABLE
        at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
        at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
        at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
        at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
        - locked <0x00000005f0096948> (a java.io.InputStreamReader)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:154)
        at java.io.BufferedReader.readLine(BufferedReader.java:317)
        - locked <0x00000005f0096948> (a java.io.InputStreamReader)
        at java.io.BufferedReader.readLine(BufferedReader.java:382)
        at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
        at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
        at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
        at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
        at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
        at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

not-found.png
12/Oct/17 10:01
12 kB
zhoukang

Issue Links

duplicates

SPARK-20656 Incremental parsing of event logs in SHS

Resolved

links to

[Github] Pull Request #19482 (caneGuy)

Activity

People

Assignee:: Unassigned

Reporter:: zhoukang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Oct/17 09:57

Updated:: 17/Oct/17 07:00

Resolved:: 17/Oct/17 07:00