Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8838

Apache Beam Metrics Counter giving incorrect count using SparkRunner

Details

    • Bug
    • Status: Resolved
    • P3
    • Resolution: Won't Fix
    • 2.13.0, 2.14.0, 2.16.0
    • Missing
    • runner-spark
    • None
    • Cloudera Express 6.2.0
      Java Version: 1.8.0_181
      Spark 2.4.0-cdh6.2.0
      1 Master Node and 3 Data node(64 cores, 128GB RAM)
      --driver-memory "2g" --num-executors "6" --executor-cores "3"
    • Important

    Description

      I am having source and target csv files with 10 million records and 250 columns. I am running an apache beam pipeline which joins all columns from source and target file. When I run this on spark cluster the pipeline executes correctly with no exceptions but, The join beam metrics counter returns double count when the following spark property is used. – executor-memory "2g" But, When I increase the excutor-memory to 11g then it returns the correct count.
      Count doubles only when I dump the results to file but if I don't dump then counts are correct.

      Note : https://stackoverflow.com/questions/59032734/apache-beam-metrics-counter-giving-incorrect-count-using-sparkrunner?noredirect=1#comment104344657_59032734

      Attachments

        Activity

          People

            Unassigned Unassigned
            ghosh kunal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: