Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25643

Performance issues querying wide rows

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      Querying a small subset of rows from a wide table (e.g., a table with 6000 columns) can be quite slow in the following case:

      • the table has many rows (most of which will be filtered out)
      • the projection includes every column of a wide table (i.e., select *)
      • predicate push down is not helping: either matching rows are sprinkled fairly evenly throughout the table, or predicate push down is switched off

      Even if the filter involves only a single column and the returned result includes just a few rows, the query can run much longer compared to an equivalent query against a similar table with fewer columns.

      According to initial profiling, it appears that most time is spent realizing the entire row in the scan, just so the filter can look at a tiny subset of columns and almost certainly throw the row away. The profiling shows 74% of time is spent in FileSourceScanExec, and that time is spent across numerous writeFields_0_xxx method calls.

      If Spark must realize the entire row just to check a tiny subset of columns, this all sounds reasonable. However, I wonder if there is an optimization here where we can avoid realizing the entire row until after the filter has selected the row.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bersprockets Bruce Robbins
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: