Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-17143

Streaming with multiple shards can trigger unexpected IdleTimeout

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 9.4.1
    • None
    • SolrCloud
    • None

    Description

      With the new test case submitted , we re-produced an issue with streaming in our production cloud environment.

      The test case creates a collection of 2 shards, which 20k docs are indexed. 10k docs have id with routing prefix `a`, while the other 10k with `c`. Each of those prefix would hash to different shard, producing 2 shards of 10k docs each.

      Now, if we stream by sorting on the id, both shards would send back some data initially, however only one shard (that hosts prefix `a`) will have continued traffic due to the sorted iteration, the other shard would eventually throw IdleTimeout as the stream was pending w/o network activity.

      If we change the test case `SHARD_COUNT` from 2 to 1, then the case runs fine.

      In our environment, we have jetty http connector timeout as 120 secs, yet we still run into that occasionally, the client does consume the data in a reasonable rate, however with up to 1024 shards per collection, it's quite easy that some shards might not have data streamed within 120 secs hence triggering the mentioned timeout.

      We assume such issue with streaming is not uncommon for any distributed system, and am wondering what could be done to fix or mitigate that.

      Several ideas that we have:
      1. If possible, we might want to stream per shard instead of per collection. However, there are cases that we do want to stream on the whole collection with sorted ordering
      2. Are there any low level "keep-alive" that is already built in? I couldn't find any so far
      3. Keep the stream alive by pushing small amount of dummy data from the aggregator (the solr node which distributes the stream request as /export to other nodes) but it got very hacky and is still not working. Didn't dig too deep as I wish to surface this issue to the Solr community and gather some thoughts first!

      Attachments

        Activity

          People

            Unassigned Unassigned
            patson Patson Luk
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: