Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6714

additionally overload KafkaUtils.createDirectStream for using a messageHandler without having to specify the offsets

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Trivial
    • Resolution: Won't Fix
    • 1.3.0
    • None
    • DStreams

    Description

      Currently, in the Scala API, KafkaUtils.createDirectStream has two overload methods, one for an "easy mode" where only the Kafka parameters and topics are specified, and other "hard mode" where we can specify the offsets and a messageHandler for manipulating the MessageAndMetadata values obtained from Kafka. I think an intermediate method that automatically handles the offsets, but that allows you to specify the messageHandler would be very useful. For example the triple (topic, partition, offset) uniquely identifies each message, so that could be useful as primary key for idempotent insertions in an external database. Also, in an implementation of a lambda architecture, offsets could be used to trace which part of a kafka topic has been covered by the batch view, and which part by the real-time / live view. For both cases I think that having access to the meta information through a messageHandler, while maintaining managed offsets, would be interesting

      A proposal for the implementation is available at https://github.com/juanrh/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala. A new overload of KafkaUtils.createDirectStream for the Scala API, and another for the Java API, have been added

      Attachments

        Activity

          People

            Unassigned Unassigned
            juanrh Juan Rodríguez Hortalá
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: