Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28190

Data Source - State

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • Structured Streaming
    • None

    Description

      "State" is becoming one of most important data on most of streaming frameworks, which makes us getting continuous result of the query. In other words, query could be no longer valid once state is corrupted or lost.

      Ideally we could run the query from the first of data to construct a brand-new state for current query, but in reality it may not be possible for many reasons, like input data source having retention, lots of resource waste to rerun from start, etc.

       

      There're other cases which end users want to deal with state, like creating initial state from existing data via batch query (given batch query could be far more efficient and faster).

      I'd like to propose a new data source which handles "state" in batch query, enabling read and write on state.

      Allowing state read brings couple of benefits:

      • You can analyze the state from "outside" of your streaming query
      • It could be useful when there's something which can be derived from existing state of existing query - note that state is not designed to be shared among multiple queries

      Allowing state (re)write brings couple of major benefits:

      • State can be repartitioned physically
      • Schema in state can be changed, which means you don't need to run the query from the start when the query should be changed
      • You can remove state rows if you want, like reducing size, removing corrupt, etc.
      • You can bootstrap state in your new query with existing data efficiently, don't need to run streaming query from the start point

      Btw, basically I'm planning to contribute my own works (https://github.com/HeartSaVioR/spark-state-tools), so for many of sub-issues it would require not-too-much amount of efforts to submit patches. I'll try to apply new DSv2, so it could be a major effort while preparing to donate code.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: