Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-4803

Beam spark runner not working properly with kafka

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • 2.4.0, 2.5.0
    • None
    • io-java-kafka, runner-spark
    • None

    Description

      We are running a beam stream processing job on a spark runner, which reads from a kafka topic using kerberos authentication. We are using java-io-kafka v2.4.0 to read from kafka topic in the pipeline. The issue is that the kafkaIO client is continuously creating a new kafka consumer with specified config, doing kerberos login every time. Also, there are spark streaming jobs which get spawned for the unbounded source, every second or so even when there is no data in the kafka topic. Log has these jobs-

      INFO SparkContext: Starting job: DStream@SparkUnboumdedSource.java:172

      We can see in the logs

      INFO MicrobatchSource: No cached reader found for split: [org.apache.beam.sdk.io.kafka.KafkaUnboundedSource@2919a728]. Creating new reader at checkpoint mark...

      And then it creates new consumer doing fresh kerberos login, which is creating issues.

      We are unsure of what should be correct behavior here and why so many spark streaming jobs are getting created. We tried the beam code with flink runner and did not find this issue there. Can someone point to the correct settings for using unbounded kafka source with spark runner using beam? 

      Attachments

        Activity

          People

            Unassigned Unassigned
            vivek_17 Vivek Agarwal
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: