Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2218

When using Block Compression in the SequenceFileBolt some Tuples may be acked before the data is flushed to HDFS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • storm-hdfs

    Description

      In AbstractHDFSBolt, the tuples are being acked after calling syncAllWriters(), that basically ends up calling doSync() in every writer. In the case of the SequenceFileWriter, that is the same as calling the hsync() method of SequenceFile.Writer:

      https://github.com/apache/storm/blob/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/common/SequenceFileWriter.java#L52

      The problem in the case of the block compression is that if there is a compression block opened it is not flushed with hsync(), instead it is necessary to call the sync() method, that adds a sync marker, compresses the block and writes it to the output stream that is flushed with hsync(). This is also done automatically when a certain size is reached in the compression block, but we cannot have certainty of the data being flushed until we call sync() and then hsync():

      https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java#L1549

      The easy fix is just add a call to sync() in case the writer is using Block Compression. I'm concerned about the impact that would have in the block size, but I think it is the only way of writing the data reliably in this case.

      Attachments

        Activity

          People

            Unassigned Unassigned
            yoelcabo Yoel Cabo Lopez
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: