[SPARK-34588] Support int64 buffer lengths in Java for pyspark Pandas UDF as buffer expanding - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.0.2
Fix Version/s: 3.1.1
Component/s: PySpark
Labels:
None
Environment:
Hadoop part:
- spark 3.0.2
- java 1.8.0_77
- scala 2.12.10
Python part:
- cython 0.29.22
- numpy 1.19.5
- pandas 1.1.5
- pyarrow 2.0.0

External issue URL:
https://issues.apache.org/jira/browse/ARROW-10957#

Description

This issue is an extention of arrow issue for making possible using pyspark Pandas UDF functions for data more than 2gb per data group.

Here is the deal - arrow supports long type for data serialization between java and python but spark doesn't. It gives a lot of problem when somebody is trying to apply Pandas UDF for dataset where any group is more than 2^32(-1) bytes what is equal to 2gb. Solving this problem will help to use more data per Pandas UDF groupping - 2^64(-1) bytes.

Attachments

Issue Links

duplicates

SPARK-33213 Upgrade Apache Arrow to 2.0.0

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Dmitry Kravchuk

Shepherd:: Micah Kornfield

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Mar/21 06:12

Updated:: 12/Dec/22 18:11

Resolved:: 11/Mar/21 01:19