We have been running a Spark structured job on production for more than a week now. Put simply, it reads data from source Kafka topics (with 4 partitions) and writes to another kafka topic. Everything has been running fine until the job started failing with the following error:
Looking at the Spark structured streaming query progress logs, it seems like the endOffsets computed for the next batch was actually smaller than the starting offset:
Microbatch Trigger 1:
Next micro batch trigger:
Notice that for partition 3 of the kafka topic, the endOffsets are actually smaller than the starting offsets!
Checked the HDFS checkpoint dir and the checkpointed offsets look fine and point to the last committed offsets
Why is the end offset for a partition being computed to a smaller value?