Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47759

Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Patch, Important

    Description

      Symptom

      It's observed that our Spark apps occasionally got stuck with an unexpected stack trace when reading/parsing a legitimate time string. Note that we manually killed the stuck app instances and the retry goes thru on the same cluster (without requiring any app code change).

       

      [Stack Trace 1] The stack trace doesn't make sense since 120s is a legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 runtime.

      Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
      Failed to parse time string: 120s
          at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
          at org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
          at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
          at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
          at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
          at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33)
          at org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
          at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
          at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
          at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
          at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
          at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
          at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
          at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
          at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
          at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
          at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
          at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
          at org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
          at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
          at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
          at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
          at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
          at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
          at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
          at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
          at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
          at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.base/java.lang.Thread.run(Thread.java:840) 

       

      [Stack Trace 2] The stack trace is slightly different but it's also about reading/parsing a time string on the line "timeout = { Utils.timeStringAsSeconds(finalProp._2).seconds }" in in RpcTimeout.scala, where the app runs on emr-7.0.0 with Spark 3.5.0 runtime.

      24/04/02 14:20:14 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
      java.lang.NullPointerException: Cannot invoke "scala.concurrent.duration.Duration$.apply(long, java.util.concurrent.TimeUnit)" because "scala.concurrent.duration.Duration$.MODULE$" is null
          at scala.concurrent.duration.package$DurationLong$.durationIn$extension(package.scala:60) ~[scala-library-2.12.17.jar:?]
          at scala.concurrent.duration.package$DurationLong.durationIn(package.scala:60) ~[scala-library-2.12.17.jar:?]
          at scala.concurrent.duration.DurationConversions.seconds(DurationConversions.scala:37) ~[scala-library-2.12.17.jar:?]
          at scala.concurrent.duration.DurationConversions.seconds$(DurationConversions.scala:37) ~[scala-library-2.12.17.jar:?]
          at scala.concurrent.duration.package$DurationLong.seconds(package.scala:59) ~[scala-library-2.12.17.jar:?]
          at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:689) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66) ~[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:274) [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) [netty-handler-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192) [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.96.Final.jar:4.1.96.Final]
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.96.Final.jar:4.1.96.Final]
          at java.lang.Thread.run(Thread.java:840) [?:?] 

       

      [Stack Trace 3] The stack trace doesn't make sense since 3s is a legitimate time string, where the app runs on emr-6.9.0 with Spark 3.3.0 runtime. Given that RpcUtils$.retryWaitMs has been refactored away in this commit, this is out of scope for the PR.

      24/03/01 21:04:24 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 7358995660930182634
      java.lang.NumberFormatException: Time must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
      Failed to parse time string: 3s
          at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:252) ~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.internal.config.ConfigHelpers$.timeFromString(ConfigBuilder.scala:56) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1(ConfigBuilder.scala:256) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1$adapted(ConfigBuilder.scala:256) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.internal.config.ConfigEntryWithDefaultString.readFrom(ConfigEntry.scala:206) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.SparkConf.get(SparkConf.scala:261) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.util.RpcUtils$.retryWaitMs(RpcUtils.scala:46) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:34) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
          at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163) ~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
      
          ...
      
          at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
          at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
          at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
          at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] 

      What's going on?

      "Failed to parse time string" indicates that JavaUtils.timeStringAs throws the following exception even if the input is a legitimate time string like 120s or 3s.

      Matcher m = TIME_PATTERN.matcher(lower);
      if (!m.matches()) {
        throw new NumberFormatException("Failed to parse time string: " + str);
      } 

      Proposed mitigation

      • Under the circumstance where a caller can fall back to a default value of a given time unit (e.g., second, ms), we should swallow runtime exceptions triggered by the regex logic and unit conversion, especially for the critical path like RpcUtils$.askRpcTimeout in the above stack traces.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            xiongbo Bo Xiong
            xiongbo Bo Xiong

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 4h
                4h
                Remaining:
                Remaining Estimate - 4h
                4h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment