Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11881

Retry update requests from leaders to replicas

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 7.5, 8.0
    • None
    • None

    Description

      We can see that a connection reset is causing LIR.

      If a leader -> replica update get's a connection like this the leader will initiate LIR

      2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX r:core_node56 collection_shardX_replicaY] o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on replica https://host08.domain:8985/solr/collection_shardX_replicaY/
      java.net.SocketException: Connection reset
              at java.net.SocketInputStream.read(SocketInputStream.java:210)
              at java.net.SocketInputStream.read(SocketInputStream.java:141)
              at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
              at sun.security.ssl.InputRecord.read(InputRecord.java:503)
              at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
              at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
              at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
              at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
              at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
              at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
              at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
              at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
              at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
              at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
              at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
              at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
              at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
              at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
              at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
              at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

      From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy working SolrCloud cluster, even a rare response like this from a replica can cause a recovery and heavy cluster disruption" .

      Looking at SOLR-6931 we added a http retry handler but we only retry on GET requests. Updates are POST requests ConcurrentUpdateSolrClient#sendUpdateStream

      Update requests between the leader and replica should be retry-able since they have been versioned.

      Attachments

        1. SOLR-11881.patch
          2 kB
          Varun Thacker
        2. SOLR-11881.patch
          2 kB
          Varun Thacker
        3. SOLR-11881-SolrCmdDistributor.patch
          17 kB
          Tomas Eduardo Fernandez Lobbe
        4. SOLR-11881.patch
          47 kB
          Tomas Eduardo Fernandez Lobbe

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tflobbe Tomas Eduardo Fernandez Lobbe
            varun Varun Thacker
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment