[SOLR-12658] Extend support for more than 4 field in 'partitionKeys' in ParallelStream after SOLR-11598 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: streaming expressions
Labels:
None

Description

~~SOLR-11598~~ extended the capabilities for Export handler to have more than 4 fields for sorting.

As streaming expressions leverages Export handler, ParallelStream allowed maximum 4 fields in "partitionKeys" and silently ignored rest of the fields if more than 4 are specified.

HashQParserPlugin:CompositeHash: 347

  private static class CompositeHash implements HashKey {

    private HashKey key1;
    private HashKey key2;
    private HashKey key3;
    private HashKey key4;

    public CompositeHash(HashKey[] hashKeys) {
      key1 = hashKeys[0];
      key2 = hashKeys[1];
      key3 = (hashKeys.length > 2) ? hashKeys[2] : new ZeroHash();
      key4 = (hashKeys.length > 3) ? hashKeys[3] : new ZeroHash();
    }

    public void setNextReader(LeafReaderContext context) throws IOException {
      key1.setNextReader(context);
      key2.setNextReader(context);
      key3.setNextReader(context);
      key4.setNextReader(context);
    }

    public long hashCode(int doc) throws IOException {
      return key1.hashCode(doc)+key2.hashCode(doc)+key3.hashCode(doc)+key4.hashCode(doc);
    }
  }

To make sure we have documents distributed across workers when executing streaming expression parallely, all the fields specified in 'partitionKeys' should be considered in calculating to which worker particular document should go for further processing.

Use-case where having this flexibility would beneficial:

parallel(workerCollection,
         search(collection1, q=*:*, fl="id, org, dept, year, month, date, hour", 
          sort="org desc, dept dec, year desc, month desc, date desc, hour desc", 
          partitionKeys="org, dept, year, month"),
          workers="6",
          zkHost="localhost:9983",
          sort="year desc")

In this case, we are partitioning on "org, dept, year, month".
Now look at the data:
org dept year month date hour

org1 dept1 1991 jan 24 11
org1 dept1 1991 jan 24 12
org1 dept1 1991 jan 24 13
....................
....................
org2 dept1 1991 jan 24 11

For data to be distributed equally to stated "6" workers, 6 respective subsets needs to be created at first place.
As we can see in the data, the partition keys specified have two unique sets

{"org1 dept1 1991 jan", "org2 dept2 1991 jan"}

and only 2 workers will be used out of 6.
Also, if we look at the data we have documents for "org1" are much more than "org2", leading to one of workers doing more work than the other; where better partition of data could have optimised the processing of documents.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-12658.patch
11/Aug/18 15:06
2 kB
Amrit Sarkar

Activity

People

Assignee:: Unassigned

Reporter:: Amrit Sarkar

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 11/Aug/18 15:02

Updated:: 08/Jun/19 15:01