Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32141

Repartition leads to out of memory

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.4.4
    • None
    • EC2
    • None

    Description

       We  have an application that does aggregation on 7 columns. In order to avoid shuffles we thought of doing repartition on those 7 columns. It works well with 1 to 4tb of data. When it gets over 4Tb, it  fails with OOM or disk space.

       

      Do we have a better  approach to reduce the shuffle ? For our biggest dataset, the spark job never ran with repartition.  We are out of options.

       

      We do have a 24 node cluster with r5.24X machines and 1TB of disk.  Our shuffle partition is set to 6912. 

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            lekshminair24 Lekshmi Nair

            Dates

              Created:
              Updated:

              Slack

                Issue deployment