Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
At the moment, Pyspark reads binary files into a byte array directly. This means it reads the full binary file immediately into memory, which is 1) memory in-efficient 2) differs from the Scala implementation (see pyspark here: https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles).
In Scala, Spark returns a PortableDataStream, which means the application does not need to read the full content of the stream in memory to work on it (see https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext).
Hence, it is proposed to adapt the Pyspark implementation to return something similar to a PortableDataStream in Scala (e.g. BytesIO
Reading binary files in an efficient manner is crucial for many IoT applications, but potentially also other fields (e.g. disk image analysis in forensics).