[SPARK-21657] Spark has exponential time complexity to explode(array of structs) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
- cache
- caching
- collections
- nested_types
- performance
- pyspark
- sparksql
- sql

Description

It can take up to half a day to explode a modest-sized nested collection (0.5m).
On a recent Xeon processors.

See attached pyspark script that reproduces this problem.

cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + table_name).cache()
print sqlc.count()

This script generate a number of tables, with the same total number of records across all nested collection (see `scaling` variable in loops). `scaling` variable scales up how many nested elements in each record, but by the same factor scales down number of records in the table. So total number of records stays the same.

Time grows exponentially (notice log-10 vertical axis scale):

At scaling of 50,000 (see attached pyspark script), it took 7 hours to explode the nested collections (!) of 8k records.

After 1000 elements in nested collection, time grows exponentially.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

nested-data-generator-and-test.py
07/Aug/17 18:22
3 kB
Ruslan Dautkhanov
ExponentialTimeGrowth.PNG
07/Aug/17 18:21
54 kB
Ruslan Dautkhanov

Issue Links

is related to

SPARK-4502 Spark SQL reads unneccesary nested fields from Parquet

Resolved

relates to

SPARK-16998 select($"column1", explode($"column2")) is extremely slow

Resolved

SPARK-22330 Linear containsKey operation for serialized maps.

Resolved

SPARK-15214 Implement code generation for Generate

Resolved

SPARK-22385 MapObjects should not access list element by index

Resolved

links to

[Github] Pull Request #19683 (uzadude)

(1 links to)

Activity

People

Assignee:: Ohad Raviv

Reporter:: Ruslan Dautkhanov

Votes:: 4 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 07/Aug/17 18:21

Updated:: 31/Jan/18 18:57

Resolved:: 29/Dec/17 13:09