Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40063

pyspark.pandas .apply() changing rows ordering

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.3.0
    • None
    • Pandas API on Spark
    • Databricks Runtime 11.1

    Description

      When using the apply function to apply a function to a DataFrame column, it ends up mixing the column's rows ordering.

      A command like this:

      def example_func(df_col):
        return df_col ** 2 
      
      df['col_to_apply_function'] = df.apply(lambda row: example_func(row['col_to_apply_function']), axis=1) 

      A workaround is to assign the results to a new column instead of the same one, but if the old column is dropped, the same error is produced.

      Setting one column as index also didn't work.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            marcelorossini Marcelo Rossini Castro

            Dates

              Created:
              Updated:

              Slack

                Issue deployment