Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47959

Improve GET_JSON_OBJECT performance on executors running multiple tasks

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.1
    • None
    • Spark Core
    • None

    Description

      We have a Spark executor that is running 32 workers in parallel.  The query is a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

      We noticed that 80+% of the stacktrace of the worker threads are blocked on the following stacktrace:

       

      com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - blocked on java.lang.Object@7529fde1 com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947) com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482) com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339) com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870) com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825) com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798) com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240) org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383) org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287) org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198) org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196) org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown Source)
      ...
      

       

      Apparently jackson-core has such a performance bug from version 2.3 - 2.15, and not fixed until version 2.18 (unreleased): https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad88c3e/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L50 

                  synchronized (lock) {
                      if (size() >= MAX_ENTRIES) {
                          clear();
                      }
                  }
      

       
      instead of https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909cabf6b/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L59 

                  /* As of 2.18, the limit is not strictly enforced, but we do try to
                   * clear entries if we have reached the limit. We do not expect to
                   * go too much over the limit, and if we do, it's not a huge problem.
                   * If some other thread has the lock, we will not clear but the lock should
                   * not be held for long, so another thread should be able to clear in the near future.
                   */
                  if (lock.tryLock()) {
                      try {
                          if (size() >= DEFAULT_MAX_ENTRIES) {
                              clear();
                          }
                      } finally {
                          lock.unlock();
                      }
                  }   

       

      Potential fixes:

      1. Upgrade to Jackson-core 2.18 when it's released;
      2. Follow https://github.com/FasterXML/jackson-core/issues/998 - I don't totally understand the options suggested by this thread yet.
      3. Introduce a new UDF that doesn't depend on jackson-core

      Attachments

        Activity

          People

            Unassigned Unassigned
            zshao Zheng Shao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: