[SPARK-28547] Make it work for wide (> 10K columns data) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Incomplete
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None
Environment:

Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per node, 32 cores (tried different configurations of executors)

Description

Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables (even simple "describe" functions applied to all the genes-columns) either takes hours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work fast (minutes) and well with pure pandas (without any spark involved).
f

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: antonkulaga

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Jul/19 09:23

Updated:: 12/Dec/22 18:11

Resolved:: 05/Sep/19 12:43