Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21942

DiskBlockManager crashing when a root local folder has been externally deleted by OS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 1.6.1, 1.6.2, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.2.0
    • None
    • Spark Core

    Description

      DiskBlockManager has a notion of a "scratch" local folder(s), which can be configured via spark.local.dir option, and which defaults to the system's /tmp. The hierarchy is two-level, e.g. /blockmgr-XXX.../YY, where the YY part is a hash bit, to spread files evenly.

      Function DiskBlockManager.getFile expects the top level directories (blockmgr-XXX...) to always exist (they get created once, when the spark context is first created), otherwise it would fail with a message like:

      ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
      

      However, this may not always be the case.

      In particular, if it's the default /tmp folder, there can be different strategies of automatically removing files from it, depending on the OS:

      • on the boot time
      • on a regular basis (e.g. once per day via a system cron job)
      • based on the file age

      The symptom is that after the process (in our case, a service) using spark is running for a while (a few days), it may not be able to load files anymore, since the top-level scratch directories are not there and DiskBlockManager.getFile crashes.

      Please note that this is different from people arbitrarily removing files manually.
      We have both the facts that /tmp is the default in the spark config and that the system has the right to tamper with its contents, and will do it with a high probability, after some period of time.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rshest Ruslan Shestopalyuk
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: