Details
-
Question
-
Status: Resolved
-
Major
-
Resolution: Invalid
-
2.2.0
-
None
-
None
Description
I have a pyspark dataframe with about 10 thousand records,while using pyspark api to dump the whole dataset. It take 10 seconds. While I use filter api to select 10 records and dump the temp_df again. It take 8 seconds.why will it take so much time? How can I improve it? Thank you!
MLUtils.convertVectorColumnsToML(dataframe).write.format("libsvm").save('path'), mode='overwrite'),
temp_df = dataframe.filter(train_df['__index'].between(int(0,10))