Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
SOLR-11598 extended the capabilities for Export handler to have more than 4 fields for sorting.
As streaming expressions leverages Export handler, ParallelStream allowed maximum 4 fields in "partitionKeys" and silently ignored rest of the fields if more than 4 are specified.
HashQParserPlugin:CompositeHash: 347
private static class CompositeHash implements HashKey { private HashKey key1; private HashKey key2; private HashKey key3; private HashKey key4; public CompositeHash(HashKey[] hashKeys) { key1 = hashKeys[0]; key2 = hashKeys[1]; key3 = (hashKeys.length > 2) ? hashKeys[2] : new ZeroHash(); key4 = (hashKeys.length > 3) ? hashKeys[3] : new ZeroHash(); } public void setNextReader(LeafReaderContext context) throws IOException { key1.setNextReader(context); key2.setNextReader(context); key3.setNextReader(context); key4.setNextReader(context); } public long hashCode(int doc) throws IOException { return key1.hashCode(doc)+key2.hashCode(doc)+key3.hashCode(doc)+key4.hashCode(doc); } }
To make sure we have documents distributed across workers when executing streaming expression parallely, all the fields specified in 'partitionKeys' should be considered in calculating to which worker particular document should go for further processing.
Use-case where having this flexibility would beneficial:
parallel(workerCollection, search(collection1, q=*:*, fl="id, org, dept, year, month, date, hour", sort="org desc, dept dec, year desc, month desc, date desc, hour desc", partitionKeys="org, dept, year, month"), workers="6", zkHost="localhost:9983", sort="year desc")
In this case, we are partitioning on "org, dept, year, month".
Now look at the data:
org dept year month date hour
org1 dept1 1991 jan 24 11 org1 dept1 1991 jan 24 12 org1 dept1 1991 jan 24 13 .................... .................... org2 dept1 1991 jan 24 11
For data to be distributed equally to stated "6" workers, 6 respective subsets needs to be created at first place.
As we can see in the data, the partition keys specified have two unique sets
and only 2 workers will be used out of 6.
Also, if we look at the data we have documents for "org1" are much more than "org2", leading to one of workers doing more work than the other; where better partition of data could have optimised the processing of documents.