[SPARK-43715] Add spark DataFrame binary file format writer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Won't Do
Affects Version/s: 3.5.0
Fix Version/s: None
Component/s: ML
Labels:
None

Description

In new distributed spark ML module (designed to support spark connect and support local inference)

We need to save ML model to hadoop file system using custom binary file format, the reason is:

We often submit a spark application to spark cluster for running the training model job, we need to save trained model to hadoop file system before the spark application completes.
But we want to support local model inference, that means if we save the model by current spark DataFrame writer (e.g. parquet format), when loading model we have to rely on the spark service. But we hope we can load model without spark service. So we want the model being saved as the original binary format that our ML code can handle.

We already have reader API of "binaryFile" format, we need to add a writer API:

Writer API:

Supposing we have a dataframe with schema:

[file_path: String, content: binary],

we can save the dataframe to a hadoop path, each row we will save it as a file under the hadoop path, the saved file path is {hadoop path}/{file_path}, "file_path" can be a multiple part path.

Attachments

Activity

People

Assignee:: Weichen Xu

Reporter:: Weichen Xu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/May/23 12:07

Updated:: 06/Jun/23 08:37

Resolved:: 06/Jun/23 08:37