Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
3.0.2
-
None
-
Hadoop part:
- spark 3.0.2
- java 1.8.0_77
- scala 2.12.10
Python part:
- cython 0.29.22
- numpy 1.19.5
- pandas 1.1.5
- pyarrow 2.0.0
Description
This issue is an extention of arrow issue for making possible using pyspark Pandas UDF functions for data more than 2gb per data group.
Here is the deal - arrow supports long type for data serialization between java and python but spark doesn't. It gives a lot of problem when somebody is trying to apply Pandas UDF for dataset where any group is more than 2^32(-1) bytes what is equal to 2gb. Solving this problem will help to use more data per Pandas UDF groupping - 2^64(-1) bytes.
Attachments
Issue Links
- duplicates
-
SPARK-33213 Upgrade Apache Arrow to 2.0.0
- Resolved