Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.4.4
-
None
-
None
Description
We have an application that does aggregation on 7 columns. In order to avoid shuffles we thought of doing repartition on those 7 columns. It works well with 1 to 4tb of data. When it gets over 4Tb, it fails with OOM or disk space.
Do we have a better approach to reduce the shuffle ? For our biggest dataset, the spark job never ran with repartition. We are out of options.
We do have a 24 node cluster with r5.24X machines and 1TB of disk. Our shuffle partition is set to 6912.