Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28843

Set OMP_NUM_THREADS to executor cores reduce Python memory consumption

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.3, 2.4.3, 3.0.0
    • 3.0.0
    • PySpark
    • Hide
      Pyspark workers now set the env variable OMP_NUM_THREADS (if not already set) to the number of cores used by an executor (spark.executor.cores). When unset, it defaulted to the total number of VM cores. This avoids excessively large OpenMP thread pools when using, for example, numpy.
      Show
      Pyspark workers now set the env variable OMP_NUM_THREADS (if not already set) to the number of cores used by an executor (spark.executor.cores). When unset, it defaulted to the total number of VM cores. This avoids excessively large OpenMP thread pools when using, for example, numpy.

    Description

      While testing hardware with more cores, we found that the amount of memory required by PySpark applications increased and tracked the problem to importing numpy. The numpy issue isĀ https://github.com/numpy/numpy/issues/10455

      NumPy uses OpenMP that starts a thread pool with the number of cores on the machine (and does not respect cgroups). When we set this lower we see a significant reduction in memory consumption.

      This parallelism setting should be set to the number of cores allocated to the executor, not the number of cores available.

      Attachments

        Issue Links

          Activity

            People

              rdblue Ryan Blue
              rdblue Ryan Blue
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: