Details
Description
Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ?
I have a pretty large dataset on my cluster so I cannot collect it using collectAsList which download every thing at once and saturate my JVM memory
Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ?
This would be ideal to process the data batch per batch and avoid saturating the memory.
I am looking for an API like this (in Java)
var stream = dataframe.collectAsArrowStream() while (stream.hasNextBatch()) { var batch = stream.getNextBatch() // do some stuff with the arrow batch }
It would be even better if I can split the dataframe into several streams so I can download and process it in parallel