[FLINK-34982] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: Runtime / Checkpointing, Runtime / State Backends
Labels:
None

Description

This is a sub-FLIP for the disaggregated state management and its related work, please read the FLIP-423 first to know the whole story.

As outlined in FLIP-423 [1] and FLIP-427 [2], we proposed to disaggregate StateManagement and introduced a disaggregated state storage named ForSt, which evolves from RocksDB. Within the new framework, where the primary storage is placed on the remote file system, several challenges emerge when attempting to reuse the existing fault-tolerance mechanisms of local RocksDB:

Because most remote file system don't support hard-link, ForSt can't utilize hard-link to capture a consistent snapshot during checkpoint synchronous phase as rocksdb currently does.
The existing file transfer mechanism within RocksDB is inefficient during checkpoints; it involves first downloading the remote working state data to local memory and then uploading it to the checkpoint directory. Likewise, both restore and rescale face the similar problems due to superfluous data transmission.

In order to solve the above problems and improve checkpoint/restore/rescaling performance of disaggregated storage, this FLIP proposes:

A new checkpoint strategy for disaggregated state storage: leverage RocksDB's low-level api to retain a consistent snapshot during the checkpoint synchronous phase; and then transfer the snapshot files to checkpoint directory during asynchronous phase;
Accelerating checkpoint/restore/rescaling by leverage fast-duplication of remote file system, which can bypass the local TaskManager when transferring data between remote working directory and checkpoint directory.

Attachments

Issue Links

is a child of

FLINK-34984 FLIP-423: Disaggregated State Storage and Management (Umbrella FLIP)

Open

mentioned in: Page Loading...

Activity

People

Assignee:: Unassigned

Reporter:: Jinzhong Li

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Apr/24 08:03

Updated:: 01/Apr/24 11:52