[SPARK-30792] Dataframe .limit() performance improvements - ASF JIRA

XML

Word

Printable

JSON

It seems that

.limit()

is much less efficient than it could be/one would expect when reading a large dataset from parquet:

val sample = spark.read.parquet("/Some/Large/Data.parquet").limit(1000)
// Do something with sample ...

This might take hours, depending on the size of the data.

By comparison,

spark.read.parquet("/Some/Large/Data.parquet").show(1000)

is essentially instant.