Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43408

Spark caching in the context of a single job

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Trivial
    • Resolution: Invalid
    • 3.3.1
    • None
    • Shuffle
    • None

    Description

      Does caching benefit a spark job with only a single action in it? Spark IIRC already optimizes shuffles by persisting them onto the disk

      I am unable to find a counter-example where caching would benefit a job with a single action. In every case I can think of, the shuffle checkpoint acts as a good enough caching mechanism in itself

      FWIW, I am talking specifically in the context of the Dataframe API. The StorageLevel allowed in my case is DISK_ONLY i.e. I am not looking to speed up by caching data in memory

      To rephrase, is DISK_ONLY caching better or same as shuffle checkpointing in the context of a single action

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            haldefaiz Faiz Halde
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment