Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47793

Implement SimpleDataSourceStreamReader for python streaming data source

    XMLWordPrintableJSON

Details

    Description

       SimpleDataSourceStreamReader is a simplified version of the DataStreamReader interface.

      1. It doesn’t require developers to reason about data partitioning.
      2. It doesn’t require getting the latest offset before reading data.

      There are 3 functions that needs to be defined 

      1. Read data and return the end offset.

      def read(self, start: Offset) -> (Iterator[Tuple], Offset)

      2. Read data between start and end offset, this is required for exactly once read.

      def read2(self, start: Offset, end: Offset) -> Iterator[Tuple]

      3. initial start offset of the streaming query.

      def initialOffset() -> dict

      Implementation: Wrap the SimpleDataSourceStreamReader instance in a DataSourceStreamReader internally and make the prefetching and caching transparent to the data source developer. The record prefetched in python process will be sent to JVM as arrow record batches.

      Attachments

        Issue Links

          Activity

            People

              Chaoqin Chaoqin Li
              Chaoqin Chaoqin Li
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: