Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22264

History server will be unavailable if there is an event log file with large size

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.1.0
    • None
    • Deploy
    • None

    Description

      History server will be unavailable if there is an event log file with large size.
      Large size here means the replaying time is too long.
      We can fix this to add a timeout for event log replaying.
      Here is an example:
      Every application submitted after restart can not open history ui.

      From event log directory we can find an event log file size is bigger than 130GB.

      hadoop *144149840801* 2017-08-29 14:03 /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
      

      and from jstack and server log we can see replaying task blocked on this event log:
      server log:

      2017-10-12,16:00:12,151 INFO org.apache.spark.deploy.history.FsHistoryProvider: Replaying log path: hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
      2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: Begin to replay hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
      

      jstack

      "log-replay-executor-0" daemon prio=10 tid=0x00007f0f48014800 nid=0x6160 runnable [0x00007f0f4f6f5000]
         java.lang.Thread.State: RUNNABLE
              at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
              at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
              at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
              at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
              at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
              at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
              at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
              - locked <0x00000005f0096948> (a java.io.InputStreamReader)
              at java.io.InputStreamReader.read(InputStreamReader.java:184)
              at java.io.BufferedReader.fill(BufferedReader.java:154)
              at java.io.BufferedReader.readLine(BufferedReader.java:317)
              - locked <0x00000005f0096948> (a java.io.InputStreamReader)
              at java.io.BufferedReader.readLine(BufferedReader.java:382)
              at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
              at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
              at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
              at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
              at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
              at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
              at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
              at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        1. not-found.png
          12 kB
          zhoukang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cane zhoukang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: