[SPARK-42694] Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.1.1
Fix Version/s: None
Component/s: Spark Core
Labels:
- shuffle
- spark
Environment:

Spark 3.1.1

Hadoop 3.2.1

Hive 3.1.2

Description

We are currently using Spark version 3.1.1 in our production environment. We have noticed that occasionally, after executing 'insert overwrite ... select', the resulting data is inconsistent, with some data being duplicated or lost. This issue does not occur all the time and seems to be more prevalent on large tables with tens of millions of records.

We compared the execution plans for two runs of the same SQL and found that they were identical. In the case where the SQL was executed successfully, the amount of data being written and read during the shuffle stage was the same. However, in the case where the problem occurred, the amount of data being written and read during the shuffle stage was different. Please see the attached screenshots for the write/read data during shuffle stage.

Normal SQL:

SQL with issues:

Is this problem caused by a bug in version 3.1.1, specifically (~~SPARK-34534~~): 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness'? Or is it caused by something else? What could be the root cause of this problem?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2023-03-07-15-59-27-665.png
07/Mar/23 07:59
207 kB
FengZhou
image-2023-03-07-15-59-08-818.png
07/Mar/23 07:59
228 kB
FengZhou

Activity

People

Assignee:: Unassigned

Reporter:: FengZhou

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 07/Mar/23 07:51

Updated:: 2 days ago 12:32