Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-5157

ListSFTP for Massive Folders (without freezing)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.3.0
    • None
    • Core Framework

    Description

      Currently, if ListSFTP is used on a folder with millions and millions of files and Primary Node has only 32GB of RAM, then to create millions of flowfiles above say 40 million, it could result in frozen threads for ListSFTP, resulting in having to restart Primary Node.

      This happens when say another system sends files to your system and eventually builds up a backlog of 10s of millions of files. Recursion won't work either even if you separated by folder, or otherwise you'd need some sort of "controlRate" like processor that can pass in flowfiles into ListSFTP resulting in ListSFTP knowing when to get files (triggering on its own).

      Maybe get only 500,000 flowfiles at a time from SFTP. Or have a check for RAM so that it doesn't try to pull in more than some formula based on available Heap memory. I found it interesting GetFile has a lot of these properties while ListFile/ListSFTP/ListHDFS don't.

      Also, there seems to be situations where Nifi kind of assumes a stable environment, but in unstable ones, where memory hardware failures happen, SFTP transmission problems, internet outages, it becomes difficult to recover an ingest or know where you left off (which might be useful for ListSFTP):

      Batch-processing usually requires a system to say separate things out into X amount of files/folders that can fit into the RAM of the primary-node. We may need some kind of feature like SQL's Transaction "Commit" and "Rollback in case of error". There needs to be an efficient way for small systems to take in large volumes of data without crashing or if crashes are inevitable then it needs some sort of batch transaction that can tell you where it left off so that you don't have to pull the same folder again but only say after File Age = some-number. When you press "view state" I find it funny that you can't edit it only delete. 

      I should be able to login tomorrow and say "oh my ingest totally collapsed, but at least I know where it left off somewhat." Especially when WAL recovery is impossible due to socket connection issues between nodes (or site-to-site active connections) causing some Nifi nodes to refuse to load or recover its state.

      I would like the ability to be able to customize ListSFTP with properties in a way that tracks things better even in situations of disaster in the nifi cluster recovery. Perhaps Inputs into ListSFTP utilizing the expression language for timestamps of folders.

      I always have to place a control rate after listsftp, but i can never do control-rating within the ListSFTP where it could be vital.  I may have to make a custom processor perhaps?

      Attachments

        Activity

          People

            Unassigned Unassigned
            VictoriaAutMors B O
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: