Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31338

Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.5
    • None
    • Spark Core
    • None

    Description

      Our Use-case Details:

      While reading from a jdbc source using spark sql, we are using below read format :

      jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties).

      Table defination : 
      postgres=> \d lineitem_sf1000
      Table "public.lineitem_sf1000"
      Column | Type | Modifiers
      ----------------++---------------------------------
      l_orderkey | bigint | not null
      l_partkey | bigint | not null
      l_suppkey | bigint | not null
      l_linenumber | bigint | not null
      l_quantity | numeric(10,2) | not null
      l_extendedprice | numeric(10,2) | not null
      l_discount | numeric(10,2) | not null
      l_tax | numeric(10,2) | not null
      l_returnflag | character varying(1) | not null
      l_linestatus | character varying(1) | not null
      l_shipdate | character varying(29) | not null
      l_commitdate | character varying(29) | not null
      l_receiptdate | character varying(29) | not null
      l_shipinstruct | character varying(25) | not null
      l_shipmode | character varying(10) | not null
      l_comment | character varying(44) | not null
      Indexes:
      "l_order_sf1000_idx" btree (l_orderkey)

       

      Partition column : l_orderkey 

      numpartion : 16 

      Problem details : 

       

      SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 1500000001 AND l_orderkey < 1875000001 

      15 queries are generated with the above BETWEEN clauses. The last query looks like this below:

      SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 375000001 or l_orderkey is null 

      I*n the last query, we are trying to get the remaining records, along with any data in the table for the partition key having NULL values.*

      This hurts performance badly. While the first 15 SQLs took approximately 10 minutes to execute, the last SQL with the NULL check takes 45 minutes because it has to evaluate a second scan(OR clause) of the table for NULL values for the partition key.

      Note that I have defined the partition key of the table to be NOT NULL, at the database. Therefore, the SQL for the last partition need not have this NULL check, Spark SQl should be able to avoid such condition and this Jira is intended to fix this behavior.

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            minfa Mohit Dave
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: