Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20259

Support push down join optimizations in DataFrameReader when loading from JDBC

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 1.6.2, 2.1.0
    • None
    • SQL
    • None

    Description

      Given two dataframes loaded from the same JDBC connection:

      UnoptimizedJDBCJoin.scala
      val ordersDF = spark.read
        .format("jdbc")
        .option("url", "jdbc:postgresql:dbserver")
        .option("dbtable", "northwind.orders")
        .option("user", "username")
        .option("password", "password")
        .load().toDS
        
      val productDF = spark.read
        .format("jdbc")
        .option("url", "jdbc:postgresql:dbserver")
        .option("dbtable", "northwind.product")
        .option("user", "username")
        .option("password", "password")
        .load().toDS
        
      ordersDF.createOrReplaceTempView("orders")
      productDF.createOrReplaceTempView("product")
      
      // Followed by a join between them:
      val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
      

      Catalyst should optimize the query to be:
      SELECT northwind.product.name, SUM(northwind.orders.qty)
      FROM northwind.orders
      INNER JOIN northwind.product ON
      northwind.orders.product_id = northwind.product.product_id
      GROUP BY p.name

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              blue666man John Muller
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: