Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-12658

Extend support for more than 4 field in 'partitionKeys' in ParallelStream after SOLR-11598

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • streaming expressions
    • None

    Description

      SOLR-11598 extended the capabilities for Export handler to have more than 4 fields for sorting.

      As streaming expressions leverages Export handler, ParallelStream allowed maximum 4 fields in "partitionKeys" and silently ignored rest of the fields if more than 4 are specified.

      HashQParserPlugin:CompositeHash: 347

        private static class CompositeHash implements HashKey {
      
          private HashKey key1;
          private HashKey key2;
          private HashKey key3;
          private HashKey key4;
      
          public CompositeHash(HashKey[] hashKeys) {
            key1 = hashKeys[0];
            key2 = hashKeys[1];
            key3 = (hashKeys.length > 2) ? hashKeys[2] : new ZeroHash();
            key4 = (hashKeys.length > 3) ? hashKeys[3] : new ZeroHash();
          }
      
          public void setNextReader(LeafReaderContext context) throws IOException {
            key1.setNextReader(context);
            key2.setNextReader(context);
            key3.setNextReader(context);
            key4.setNextReader(context);
          }
      
          public long hashCode(int doc) throws IOException {
            return key1.hashCode(doc)+key2.hashCode(doc)+key3.hashCode(doc)+key4.hashCode(doc);
          }
        }
      

      To make sure we have documents distributed across workers when executing streaming expression parallely, all the fields specified in 'partitionKeys' should be considered in calculating to which worker particular document should go for further processing.

      Use-case where having this flexibility would beneficial:

      parallel(workerCollection,
               search(collection1, q=*:*, fl="id, org, dept, year, month, date, hour", 
                sort="org desc, dept dec, year desc, month desc, date desc, hour desc", 
                partitionKeys="org, dept, year, month"),
                workers="6",
                zkHost="localhost:9983",
                sort="year desc")
      

      In this case, we are partitioning on "org, dept, year, month".
      Now look at the data:
      org dept year month date hour

      org1 dept1 1991 jan 24 11
      org1 dept1 1991 jan 24 12
      org1 dept1 1991 jan 24 13
      ....................
      ....................
      org2 dept1 1991 jan 24 11
      

      For data to be distributed equally to stated "6" workers, 6 respective subsets needs to be created at first place.
      As we can see in the data, the partition keys specified have two unique sets

      {"org1 dept1 1991 jan", "org2 dept2 1991 jan"}

      and only 2 workers will be used out of 6.
      Also, if we look at the data we have documents for "org1" are much more than "org2", leading to one of workers doing more work than the other; where better partition of data could have optimised the processing of documents.

      Attachments

        1. SOLR-12658.patch
          2 kB
          Amrit Sarkar

        Activity

          People

            Unassigned Unassigned
            sarkaramrit2@gmail.com Amrit Sarkar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: