Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34015

SparkR partition timing summary reports input time correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.2, 3.0.1
    • 3.1.1, 3.2.0
    • SparkR
    • None
    • Observed on CentOS-7 running spark 2.3.1 and on my mac running master

    • Patch

    Description

      When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows.

      In detail: the variable inputElap in a wider context is used to mark the beginning of reading rows, but in the part changed here it was used as a local variable for measuring compute time. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests.

      For our application, here's what a log entry looks like before these changes were applied:

      20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s

      this indicates that we're spending more time reading rows than operating on the rows.

      After these changes, it looks like this:

      20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s

      Attachments

        Activity

          People

            WamBamBoozle Tom Howland
            WamBamBoozle Tom Howland
            hyukjin.kwon hyukjin.kwon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: