Description
Dataset.apply calls dataset.deserializer (to provide an early error) which ends up calling the full Analyzer on the deserializer. This can take tens of milliseconds, depending on how big the plan is.
Since Dataset.apply is called for many Dataset operations such as Dataset.where it can be a significant overhead for short queries.
In the following code: duration is 17 ms in current spark vs 1 ms
if I remove the line dataset.deserializer.
It seems the resulting deserializer is particularly big in the case of nested schema, but the same overhead can be observed if we have a very wide flat schema.
According to a comment in the PR that introduced this check, we can at least remove this check for DataFrames: https://github.com/apache/spark/pull/20402#discussion_r164338267
val col = "named_struct(" + (0 until 100).map { i => s"'col$i', id"}.mkString(",") + ")" val df = spark.range(10).selectExpr(col) val TRUE = lit(true) val numIter = 1000 var startTime = System.nanoTime() for(i <- 0 until numIter) { df.where(TRUE) } val durationMs = (System.nanoTime() - startTime) / numIter / 1000000 println(s"duration $durationMs")