Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41322

Optimized query plan cost/statistics overview

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.3.0
    • None
    • GraphX, SQL
    • None

    Description

      Motivation

      Spark SQL supports running the `EXPLAIN COST` statement on a query to show the optimized logical plan and its data costs per stage (i.e. statistics) https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html. However, it can currently be difficult to determine what the total data read cost will be for a complex query with many stages. Other query engines such as Trino/Presto attempt to provide a general estimate of resource costs of a query when running the `EXPLAIN` statement, which includes CPU, memory, row count, and data size https://trino.io/docs/current/optimizer/cost-in-explain.html.

      Proposal

      We suggested adding an overview/estimation of the total resources that will be used within the optimized logical plan of a Spark query, or maybe as an alternative, provide this overview/estimation when the `EXPLAIN COST` statement is called on a query. As a first version, it would already be beneficial if this general cost estimation would include anything that is available within the statistics of the optimized query plan, such as:

      • The amount of data the will be read in bytes
      • The total amount of rows 
      • etc.

      Given that the optimized logical plan is divided in stages it would already be sufficient to show these parameters per stage so they can be aggregated for the entire job later on if needed.

      Attachments

        Activity

          People

            Unassigned Unassigned
            gerbenvdhuizen Gerben van der Huizen
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: