Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42471 Distributed ML <> spark connect
  3. SPARK-43715

Add spark DataFrame binary file format writer

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Won't Do
    • 3.5.0
    • None
    • ML
    • None

    Description

      In new distributed spark ML module (designed to support spark connect and support local inference)

      We need to save ML model to hadoop file system using custom binary file format, the reason is:

      • We often submit a spark application to spark cluster for running the training model job, we need to save trained model to hadoop file system before the spark application completes.
      • But we want to support local model inference, that means if we save the model by current spark DataFrame writer (e.g. parquet format), when loading model we have to rely on the spark service. But we hope we can load model without spark service. So we want the model being saved as the original binary format that our ML code can handle.

      We already have reader API of "binaryFile" format, we need to add a writer API:

      Writer API:

      Supposing we have a dataframe with schema:

      [file_path: String, content: binary],

      we can save the dataframe to a hadoop path, each row we will save it as a file under the hadoop path, the saved file path is {hadoop path}/{file_path}, "file_path" can be a multiple part path.

      Attachments

        Activity

          People

            weichenxu123 Weichen Xu
            weichenxu123 Weichen Xu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: