Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-12843

If record count is set, ParquetRecordReader does not read the whole file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.25.0, 2.0.0-M2
    • 1.26.0, 2.0.0-M3
    • Extensions
    • None

    Description

      Earlier ParquetRecordReader ignored the record.count attribue of the incoming FlowFile. With NIFI-12241 this had been changed, and now the reader reads only the specified number of rows from the record set. But if the Parquet file is not produced by a record writer, then this attribute is not set normally, and in this case the record reader reads the whole file. However, processors producing parquet file by processing record sets, might have this attribute set, referring to the record set the parquet file is taken from, and not the actual content. This leads to an incorrect behavior.

      For example: ConsumeKafka produces a single record FlowFile, that is a parquet file with 1000 rows, then record.count would be set to 1, instead of 1000, because it refers to the Kafka record set. So ParquetRecordReader now reads only the first record of the Parquet file.

      The sole reason of changing the reader to take record.count into account is that, CalculateParquetOffsets processors generate flow files with same content, but different offset and count attributes, representing a slice of the original, big input. And then the parquet reader acts as if the big flow file was only a small one, containing that slice, which makes processing more efficient. There is no need to support files having no offset, but having a limit (count), so changing the reader to only take record.count into account, if offset is present too, could to be a reasonable fix.

      Attachments

        1. parquet_reader_usecases.json
          32 kB
          Rajmund Takacs

        Issue Links

          Activity

            People

              takraj Rajmund Takacs
              takraj Rajmund Takacs
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m