Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2235

Spark SQL insert command is actually an upsert

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.5.0
    • None
    • spark
    • None
    • CDH 5.13

    Description

      The Spark SQL 'INSERT' command is actually doing an upsert when used on a Kudu table.

      Example:
      1) Create a table in Impala like this:

      create table test13 (k1 string, c2 string, c3 string, primary key(k1))
      partition by hash partitions 2 stored as kudu
      

      2) Try an Impala INSERT to demonstrate correct insert behavior

      > insert into test13 values ('x','x','x'),('y','y','y');
      Modified 2 row(s), 0 row error(s) in 3.70s
      
      > select * from test13;
      +----+----+-------------+
      | k1 | c2 | c3          |
      +----+----+-------------+
      | x  | x  | x           |
      | y  | y  | y           |
      +----+----+-------------+
      
      > insert into test13 values ('x','x','test insert'),('z','z','z');
      WARNINGS: Key already present in Kudu table 'impala::default.test13'.
      Modified 1 row(s), 1 row error(s) in 0.11s
      
      > select * from test13;
      +----+----+-------------+
      | k1 | c2 | c3          |
      +----+----+-------------+
      | x  | x  | x           |
      | y  | y  | y           |
      | z  | z  | z           |
      +----+----+-------------+
      

      3) Try the same sequence of operations in Spark (Scala)

      scala> val test13 = spark.read.format("org.apache.kudu.spark.kudu").option("kudu.master",kuduMaster).option("kudu.table","impala::default.test13").load
      
      scala> test13.createTempView("test13")
      
      scala> spark.sql("insert into test13 values ('a','a','a'),('c','c','c')")
      
      scala> test13.show
      +---+---+---+
      | k1| c2| c3|
      +---+---+---+
      |  a|  a|  a|
      |  c|  c|  c|
      +---+---+---+
      
      
      scala> spark.sql("insert into test13 values ('a','a','test update'),('d','d','d')")
      
      scala> test13.show
      +---+---+-----------+
      | k1| c2|         c3|
      +---+---+-----------+
      |  a|  a|test update|
      |  c|  c|          c|
      |  d|  d|          d|
      +---+---+-----------+
      

      note that in Spark, but not in Impala, the row matching the existing key was changed (updated), and the row with the new key was added. Neither should happen with an insert.

      'Upsert' isn't actually a valid command.

      This important difference between Impala SQL and Spark SQL with respect to Kudu is confusing.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            dcarroll@cloudera.com Diana Carroll

            Dates

              Created:
              Updated:

              Slack

                Issue deployment