Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-7550

Need for Integrity Validation of RPC

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • ipc
    • RPC CRC

    Description

      Some recent investigation of network packet corruption has shown a need for hadoop RPC integrity validation beyond assurances already provided by 802.3 link layer and TCP 16-bit CRC.

      During an unusual occurrence on a 4k node cluster, we've seen as high as 4 TCP anomalies per second on a single node, sustained over an hour (14k per hour). A TCP anomaly would be an escaped link layer packet that resulted in a TCP CRC failure, TCP packet out of sequence
      or TCP packet size error.

      According to this paper[*]: http://tinyurl.com/3aue72r
      TCP's 16-bit CRC has an effective detection rate of 2^10. 1 in 1024 errors may escape detection, and in fact what originally alerted us to this issue was seeing failures due to bit-errors in hadoop traffic. Extrapolating from that paper, one might expect 14 escaped packet errors per hour for that single node of a 4k cluster. While the above error rate
      was unusually high due to a broadband aggregate switch issue, hadoop not having an integrity check on RPC makes it problematic to discover, and limit any potential data damage due to
      acting on a corrupt RPC message.

      ------
      [*] In case this jira outlives that tinyurl, the IEEE paper cited is: "Performance of Checksums and CRCs over Real Data" by Jonathan Stone, Michael Greenwald, Craig Partridge, Jim Hughes.

      Attachments

        Activity

          People

            davet Dave Thompson
            davet Dave Thompson
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated: