Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34536

zstd-jni lead to read less shuffle data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.0, 2.4.7
    • None
    • Spark Core

    Description

      BackGround

      I find a rare case which lead some partitions read less data when use zstd;

      Detail

      I saved normal shuffle data and wrong shuffle data and found the wrong shuffle data was the head part of the normal shuffle data, and I found zstd-jni tag 1.3.3-2 has the problems which can read  a head part of whole data and normal exit.

      The ZstdInputStream in zstd-jni(tag 1.3.3-2) maybe return 0 after a read function call, this doesn't meet the standard of InputStream, InputStream will not return 0 unless len is 0; Spark will use a BufferedInputStream wrapped to ZstdInputStream, when ZstdInputStream read call return 0, BufferedInputStream will consider the 0 as the end of read and exit, this can lead data loss.

      zstd-jni issues:

      https://github.com/luben/zstd-jni/issues/159

      zstd-jni commits:
      https://github.com/luben/zstd-jni/commit/7eec5581b8ccb0d98350ad5ba422337eebbbe70e

      zstd-jni has fixed this problems in tag 1.4.4-3, the code as follows:

      So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 for spark2.4 has a wide use in production.

       

      The BufferedInputStream's code as follows:

      Attachments

        1. image-2021-02-25-17-51-49-998.png
          325 kB
          haiyangyu
        2. image-2021-02-25-17-50-49-427.png
          398 kB
          haiyangyu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yuhaiyang haiyangyu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: