Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28025

HDFSBackedStateStoreProvider should not leak .crc files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.3
    • 2.4.4, 3.0.0
    • Structured Streaming
    • None

    Description

      The HDFSBackedStateStoreProvider when using the default CheckpointFileManager is leaving '.crc' files behind. There's a .crc file created for each `atomicFile` operation of the CheckpointFileManager.

      Over time, the number of files becomes very large. It makes the state store file system constantly increase in size and, in our case, deteriorates the file system performance.

      Here's a sample of one of our spark storage volumes after 2 days of execution (4 stateful streaming jobs, each on a different sub-dir):

      1. Total files in PVC (used for checkpoints and state store)
        $find . | wc -l
        431796
        
        # .crc files
        $find . -name "*.crc" | wc -l
        418053

      With each .crc file taking one storage block, the used storage runs into the GBs of data.

      These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, shows serious performance deterioration with this large number of files:

      DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms

       

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              gmaas Gerard Maas
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: