Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Copy-pasting the reasoning mentioned on this discussion thread.
Let me state why I think "jobmanager.adaptive-batch-scheduler.default-source-parallelism" should not be bound by the "jobmanager.adaptive-batch-scheduler.max-parallelism".
- Source vertex is unique and does not have any upstream vertices - Downstream vertices read shuffled data partitioned by key, which is not the case for the Source vertex
- Limiting source parallelism by downstream vertices' max parallelism is incorrect
- If we say for ""semantic consistency" the source vertex parallelism has to be bound by the overall job's max parallelism, it can lead to following issues:
- High filter selectivity with huge amounts of data to read
- Setting high "jobmanager.adaptive-batch-scheduler.max-parallelism" so that source parallelism can be set higher can lead to small blocks and sub-optimal performance.
- Setting high "jobmanager.adaptive-batch-scheduler.max-parallelism" requires careful tuning of network buffer configurations which is unnecessary in cases where it is not required just so that the source parallelism can be set high.
Attachments
Issue Links
- links to