Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34588

Support int64 buffer lengths in Java for pyspark Pandas UDF as buffer expanding

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.0.2
    • 3.1.1
    • PySpark
    • None
    • Hadoop part:

      • spark 3.0.2
      • java 1.8.0_77
      • scala 2.12.10

      Python part:

      • cython 0.29.22
      • numpy 1.19.5
      • pandas 1.1.5
      • pyarrow 2.0.0

    Description

      This issue is an extention of arrow issue for making possible using pyspark Pandas UDF functions for data more than 2gb per data group.

      Here is the deal - arrow supports long type for data serialization between java and python but spark doesn't. It gives a lot of problem when somebody is trying to apply Pandas UDF for dataset where any group is more than 2^32(-1) bytes what is equal to 2gb. Solving this problem will help to use more data per Pandas UDF groupping - 2^64(-1) bytes.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dishka_krauch Dmitry Kravchuk
              Micah Kornfield Micah Kornfield
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: