Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29272

dataframe.write.format("libsvm").save() take too much time

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 2.2.0
    • None
    • ML
    • None

    Description

      I have a pyspark dataframe with about 10 thousand records,while using pyspark api to dump the whole dataset. It take 10 seconds. While I use filter api to select 10 records and dump the temp_df again. It take 8 seconds.why will it take so much time? How can I improve it? Thank you!

      MLUtils.convertVectorColumnsToML(dataframe).write.format("libsvm").save('path'), mode='overwrite'),

      temp_df = dataframe.filter(train_df['__index'].between(int(0,10))

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            accelerator 张焕明
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment