Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Invalid
-
None
-
None
-
ghx-label-4
Description
Noticed a case when enabling result spooling makes query execution much slower:
impala-shell -B -q "set spool_query_results=1; select cast(l_shipdate as timestamp) from tpch_parquet.lineitem;" > /dev/null
Fetched 6001215 row(s) in 23.81s
impala-shell -B -q "set spool_query_results=0; select cast(l_shipdate as timestamp) from tpch_parquet.lineitem;" > /dev/null
Fetched 6001215 row(s) in 9.92s
Using beeswax leads to completely different results:
impala-shell --protocol=beeswax -B -q "set spool_query_results=1; select cast(l_shipdate as timestamp) from tpch_parquet.lineitem;" > /dev/null
Fetched 6001215 row(s) in 10.32s
impala-shell --protocol=beeswax -B -q "set spool_query_results=0; select cast(l_shipdate as timestamp) from tpch_parquet.lineitem;" > /dev/null
Fetched 6001215 row(s) in 11.87s
This anomaly seems to occur when both the client and the coordinator needs significant time to process the returned rows.
Note that the slow result generation from timestamps (and dates) is a known performance issue in the coordinator - most time is spent in converting dates/timestamps to strings. On the other side I don't understand how enabling result spooling can slow down a query.