RATIS-2430. Write snapshot to temporary path until finish#1372
Open
spacemonkd wants to merge 2 commits into
Open
RATIS-2430. Write snapshot to temporary path until finish#1372spacemonkd wants to merge 2 commits into
spacemonkd wants to merge 2 commits into
Conversation
Contributor
Author
|
@szetszwo could you take a look at this change? |
szetszwo
reviewed
Jun 11, 2026
Contributor
There was a problem hiding this comment.
@spacemonkd , thanks for working on this! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13082753/1372_review.patch
(Sorry that I missed this PR.)
| void finalizeSnapshot(InstallSnapshotRequestProto request) throws IOException { | ||
| final StateMachine sm = server.getStateMachine(); | ||
| sm.pause(); // pause the SM right before publishing the snapshot atomically | ||
| // TODO: if there is a failure here, we need to rollback the snapshot installation. |
Contributor
There was a problem hiding this comment.
What do you mean by "rollback the snapshot installation" ?
Comment on lines
+217
to
230
| final int expectedChunkIndex = nextChunkIndex.get(); | ||
| if (expectedChunkIndex != snapshotChunkRequest.getRequestIndex()) { | ||
| throw new IOException("Unexpected request chunk index: " + snapshotChunkRequest.getRequestIndex() | ||
| + " (the expected index is " + expectedChunkIndex + ")"); | ||
| } | ||
| // Append chunks to a temporary location first. Publish only when done=true. | ||
| state.appendSnapshot(request); | ||
| nextChunkIndex.incrementAndGet(); | ||
| // update the committed index | ||
| // re-load the state machine if this is the last chunk | ||
| if (snapshotChunkRequest.getDone()) { | ||
| state.finalizeSnapshot(request); | ||
| state.reloadStateMachine(lastIncluded); | ||
| chunk0CallId.set(-1); |
Contributor
There was a problem hiding this comment.
Let's move snapshotManager to SnapshotInstallationHandler. It is only used here.
final int expectedChunkIndex = nextChunkIndex.get();
if (expectedChunkIndex != snapshotChunkRequest.getRequestIndex()) {
throw new IOException("Unexpected request chunk index: " + snapshotChunkRequest.getRequestIndex()
+ " (the expected index is " + expectedChunkIndex + ")");
}
// Append chunks to a temporary location first. Publish only when done=true.
final StateMachine stateMachine = server.getStateMachine();
snapshotManager.appendSnapshot(request, stateMachine);
nextChunkIndex.incrementAndGet();
// update the committed index
// re-load the state machine if this is the last chunk
if (snapshotChunkRequest.getDone()) {
stateMachine.pause(); // pause the SM right before publishing the snapshot atomically
snapshotManager.finalizeSnapshot(request);
state.reloadStateMachine(lastIncluded);
chunk0CallId.set(-1);
}
Comment on lines
+175
to
178
| if (!snapshotChunkRequest.getDone()) { | ||
| throw new IOException("Cannot finalize incomplete snapshot request: " | ||
| + ServerStringUtils.toInstallSnapshotRequestString(request)); | ||
| } |
Contributor
There was a problem hiding this comment.
Let's use Preconditions:
Preconditions.assertTrue(snapshotChunkRequest.getDone());| import static org.mockito.Mockito.mock; | ||
| import static org.mockito.Mockito.when; | ||
|
|
||
| public class SnapshotManagerTest { |
Contributor
There was a problem hiding this comment.
To be consistent with the other tests,
- rename it to "TestSnapshotManager"
- move it to ratis-test.
| import static org.mockito.Mockito.when; | ||
|
|
||
| public class SnapshotManagerTest { | ||
| private static final class TestRaftStorageDirectory implements RaftStorageDirectory { |
Contributor
There was a problem hiding this comment.
Let's use RaftStorageDirectoryImpl.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Today when SnapshotInstallatioHandler#checkAndInstallSnapshot calls
state.installSnapshot(request)- it pauses the state machine viaServerState.installSnapshot().However this means that in case later checks fail or IO fails or any such scenario occurs, then there is no clear rollback option. Followers in this scenario can be left in a partial installation state.
One way to mitigate this is in appendChunk we can write to a temp file without pausing StateMachine. When this is done we can atomically apply the snapshot and reload the statemachine log.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-2430
How was this patch tested?
Patch was tested using unit tests.