Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-17281

Added support of reporting RPC round-trip time at NN.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • hdfs

    Description

      We have come across a few cases where the hdfs clients are reporting very bad latencies, while we don't see similar trends at NN-side. Instead, from NN-side, the latency metrics seem normal as usual. I attached a screenshot which we took during an internal investigation at LinkedIn. What was happening is a token management service was reporting an average latency of 1 sec in fetching delegation tokens from our NN but at the NN-side, we did not see anything abnormal. The recent OverallRpcProcessingTime metric we added in HDFS-17042 did not seem to be sufficient to identify/signal such cases. 

      We propose to extend the IPC header in hadoop, to communicate call create time at client-side to IPC servers, so that for each rpc call, the server can get its round-trip time.

       

      Why is OverallRpcProcessingTime not sufficient?

      OverallRpcProcessingTime captures the time starting from when the reader thread reads in the call from the socket to when the response is sent back to the client. As a result, it does not capture the time it takes to transmit the call from client to the server. Besides, we only have a couple of reader threads to monitor a large number of open connections. It is possible that many connections become ready to read at the same time. Then, the reader thread would need to read each call sequentially, leading to a wait time for many Rpc Calls. We have also hit the case where the callQueue becomes full (with a total of 25600 requests) and thus reader threads are blocked to add new Calls into the callQueue. This would lead to a longer latency for all connections/calls which are ready and wait to be read by reader threads. 

      Ideally, we want to measure the time between when a socket/call is ready to read and when it is actually being read by the reader thread. This would give us the wait time that a call is taking to be read. However, after some Google search, we failed to find a way to get this. 

      Attachments

        Issue Links

          Activity

            People

              xinglin Xing Lin
              xinglin Xing Lin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: