Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-8188

don't block SocketThread for MessagingService

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 2.0.12, 2.1.2
    • None
    • None

    Description

      We have two datacenters A and B.
      The node in A cannot handshake version with nodes in B, logs in A as follow:

      	
      	INFO [HANDSHAKE-/B] 2014-10-24 04:29:49,075 OutboundTcpConnection.java (line 395) Cannot handshake version with B
          TRACE [WRITE-/B] 2014-10-24 11:02:49,044 OutboundTcpConnection.java (line 368) unable to connect to /B
      		java.net.ConnectException: Connection refused
              at sun.nio.ch.Net.connect0(Native Method)
              at sun.nio.ch.Net.connect(Net.java:364)
              at sun.nio.ch.Net.connect(Net.java:356)
              at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:623)
              at java.nio.channels.SocketChannel.open(SocketChannel.java:184)
              at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:134)
              at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:119)
              at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:299)
              at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:150)
      

      The jstack output of nodes in B shows it blocks in inputStream.readInt resulting in SocketThread not accept socket any more, logs as follow:

      	   java.lang.Thread.State: RUNNABLE
              at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
              at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
              at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
              at sun.nio.ch.IOUtil.read(IOUtil.java:197)
              at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
              - locked <0x00000007963747e8> (a java.lang.Object)
              at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:203)
              - locked <0x0000000796374848> (a java.lang.Object)
              at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
              - locked <0x00000007a5c7ca88> (a sun.nio.ch.SocketAdaptor$SocketInputStream)
              at java.io.InputStream.read(InputStream.java:101)
              at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
              - locked <0x00000007a5c7ca88> (a sun.nio.ch.SocketAdaptor$SocketInputStream)
              at java.io.DataInputStream.readInt(DataInputStream.java:387)
              at org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:879)
      

      In nodes of B tcpdump shows retransmission of SYN,ACK during the tcp three-way handshake phase because tcp implementation drops the last ack when the backlog queue is full.

      In nodes of B ss -tl shows "Recv-Q 51 Send-Q 50".

      In nodes of B netstat -s shows “SYNs to LISTEN sockets dropped” and “times the listen queue of a socket overflowed” are both increasing.

      This patch sets read timeout to 2 * OutboundTcpConnection.WAIT_FOR_VERSION_MAX_TIME for the accepted socket.

      Attachments

        1. handshake.stack.txt
          300 kB
          Chris Burroughs
        2. 0001-don-t-block-SocketThread-for-MessagingService.patch
          2 kB
          Wei Yang

        Activity

          People

            wy96f Wei Yang
            wy96f Wei Yang
            Wei Yang
            Brandon Williams
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: