[KUDU-2235] Spark SQL insert command is actually an upsert - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: spark
Labels:
None
Environment:
CDH 5.13

Description

The Spark SQL 'INSERT' command is actually doing an upsert when used on a Kudu table.

Example:
1) Create a table in Impala like this:

create table test13 (k1 string, c2 string, c3 string, primary key(k1))
partition by hash partitions 2 stored as kudu

2) Try an Impala INSERT to demonstrate correct insert behavior

> insert into test13 values ('x','x','x'),('y','y','y');
Modified 2 row(s), 0 row error(s) in 3.70s

> select * from test13;
+----+----+-------------+
| k1 | c2 | c3          |
+----+----+-------------+
| x  | x  | x           |
| y  | y  | y           |
+----+----+-------------+

> insert into test13 values ('x','x','test insert'),('z','z','z');
WARNINGS: Key already present in Kudu table 'impala::default.test13'.
Modified 1 row(s), 1 row error(s) in 0.11s

> select * from test13;
+----+----+-------------+
| k1 | c2 | c3          |
+----+----+-------------+
| x  | x  | x           |
| y  | y  | y           |
| z  | z  | z           |
+----+----+-------------+

3) Try the same sequence of operations in Spark (Scala)

scala> val test13 = spark.read.format("org.apache.kudu.spark.kudu").option("kudu.master",kuduMaster).option("kudu.table","impala::default.test13").load

scala> test13.createTempView("test13")

scala> spark.sql("insert into test13 values ('a','a','a'),('c','c','c')")

scala> test13.show
+---+---+---+
| k1| c2| c3|
+---+---+---+
|  a|  a|  a|
|  c|  c|  c|
+---+---+---+


scala> spark.sql("insert into test13 values ('a','a','test update'),('d','d','d')")

scala> test13.show
+---+---+-----------+
| k1| c2|         c3|
+---+---+-----------+
|  a|  a|test update|
|  c|  c|          c|
|  d|  d|          d|
+---+---+-----------+

note that in Spark, but not in Impala, the row matching the existing key was changed (updated), and the row with the new key was added. Neither should happen with an insert.

'Upsert' isn't actually a valid command.

This important difference between Impala SQL and Spark SQL with respect to Kudu is confusing.

Attachments

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Diana Carroll

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 06/Dec/17 16:22

Updated:: 06/Dec/17 19:22

Agile

View on Board

Spark SQL insert command is actually an upsert

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment