Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-12241

Add Processors Supporting Efficient Parquet Splitting

    XMLWordPrintableJSON

Details

    Description

      SplitParquet processor that expects as input a FlowFile with Parquet content and would take as parameter a number of records as the split configuration.

      The processor would generate X flow files with unmodified content and would add attributes with the offsets required to read the group of rows in the flowfile's content.

      Then the Parquet Reader would be improved to accept optional flow file attributes containing the information so that the reader can only read the required part of the data.

      Instead of having something like

      X -> SplitRecord (Parquet / JSON) -> ...

      It'd be something like

      X -> SplitParquet -> ConvertRecord (Parquet / JSON) -> ...

      The goal here is to increase the overall efficiency of this operation for extremely large Parquet files (hundreds of GBs). With the second approach, it could leverage multi-threading for processing a single file.

      SplitParquet processor should also have a property (true/false) to write zero-content flow files. The existing FetchParquet processor should be enhanced to accept the flow file attributes for giving offsets. It'd give something like

      X -> SplitParquet -> FetchParquet (JSON Writer) -> ...

      This way, a load balanced connection could be used between SplitParquet and FetchParquet in order to distribute the work across the nodes (without transferring a lot of data across the nodes of the cluster).

      Attachments

        Issue Links

          Activity

            People

              takraj Rajmund Takacs
              takraj Rajmund Takacs
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 40m
                  3h 40m