Details
-
Question
-
Status: Resolved
-
Trivial
-
Resolution: Invalid
-
3.3.1
-
None
-
None
Description
Does caching benefit a spark job with only a single action in it? Spark IIRC already optimizes shuffles by persisting them onto the disk
I am unable to find a counter-example where caching would benefit a job with a single action. In every case I can think of, the shuffle checkpoint acts as a good enough caching mechanism in itself
FWIW, I am talking specifically in the context of the Dataframe API. The StorageLevel allowed in my case is DISK_ONLY i.e. I am not looking to speed up by caching data in memory
To rephrase, is DISK_ONLY caching better or same as shuffle checkpointing in the context of a single action