[THRIFT-5705] C++ TSSLSocket accumulates receive retry counts across read() calls - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.16.0
Fix Version/s: None
Component/s: C++ - Library
Labels:
None

Description

C++ TSocket's max receive retries has existed for 15+ years. In today's TSocket, the retry count starts at zero for a given read() call (or peek() and a couple other locations), and gets incremented if the read hits THRIFT_EAGAIN or THRIFT_EINTR. If it hits max receive retries, then it will throw an exception indicating resources unavailable. The count starts from zero for each read() call.

In ~~THRIFT-4276~~ (https://github.com/apache/thrift/commit/808d143245f4f5c30600fab31cf9db854cbf5b48), the TSSLSocket code changed and it now keeps a readRetryCount_ field. This gets incremented when read() is non-successful on the underlying SSL socket. It get zeroed when there is a successful read. The problem is that this accumulates across read() calls. If multiple read() calls time out, then readRetryCount_ continues to accumulate and can hit the limit, causing future calls to fail. This is a behavior change, and the behavior of today's TSSLSocket doesn't match the behavior from today's TSocket. This created a bug in Apache Impala (~~IMPALA-12114~~) in code that had been running smoothly for years. I think this behavior could be a mistake, and I wanted to discuss it.

Impala Use Case:

For Apache Impala, if a client is idle, we want to be able to time out and tear down the connection. We implement this by having the server side of the connection set a 30 second timeout on the socket, then sit in a loop, waking up periodically, checking whether this session has reached its timeout (which is usually much larger than 30 seconds, 30 seconds is just the granularity of the sleep), then waiting on the socket some more. Here is some pseudocode:

// The transport is a socket, and we set a timeout on the socket
socket->setRecvTimeout(30 seconds);
while (true) {
  try {
    bytes_pending = transport->peek();
  } catch (const TTransportException& exc) {
    // If this is a timeout exception, check idle time
    if (this is a timeout exception) {
       if (idle limit exceeded) {
          break;
       }
    } else {
      rethrow;
    }
  }
}
return bytes_pending;
// If this returns false, the connection is torn down.

We ran into a bug recently when the transport is TBufferedTransport with an internal TSSLSocket. The loop does 5 peek() calls, and then the next peek() call fails even though no timeout was hit. We found that this is a special interaction of TBufferedTransport and TSSLSocket. TBufferedTransport implements peek() with a call to read() on the underlying socket. In this case, each read() call is timing out, and that means that TSSLSocket's readRetryCount_ is accumulating. There is no successful call to zero it out, and it eventually hits the limit and causes subsequent calls to fail.

I don't think having readRetryCount_ accumulate across read() calls makes sense for regular blocking connections.

Attachments

Issue Links

is caused by

THRIFT-4276 Add SSL support to the C++ Nonblocking Server

Closed

is related to

IMPALA-12114 SSL Thrift connections disconnect if idle more than ~150 seconds

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joe McDonnell

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 09/May/23 04:27

Updated:: 16/Jun/23 17:11