Skip to content

RATIS-2430. Write snapshot to temporary path until finish#1372

Open
spacemonkd wants to merge 3 commits into
apache:masterfrom
spacemonkd:RATIS-2430
Open

RATIS-2430. Write snapshot to temporary path until finish#1372
spacemonkd wants to merge 3 commits into
apache:masterfrom
spacemonkd:RATIS-2430

Conversation

@spacemonkd

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Today when SnapshotInstallatioHandler#checkAndInstallSnapshot calls state.installSnapshot(request) - it pauses the state machine via ServerState.installSnapshot().

However this means that in case later checks fail or IO fails or any such scenario occurs, then there is no clear rollback option. Followers in this scenario can be left in a partial installation state.

One way to mitigate this is in appendChunk we can write to a temp file without pausing StateMachine. When this is done we can atomically apply the snapshot and reload the statemachine log.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2430

How was this patch tested?

Patch was tested using unit tests.

@spacemonkd spacemonkd marked this pull request as ready for review March 10, 2026 16:47
@spacemonkd

Copy link
Copy Markdown
Contributor Author

@szetszwo could you take a look at this change?

@szetszwo szetszwo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spacemonkd , thanks for working on this! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13082753/1372_review.patch

(Sorry that I missed this PR.)

void finalizeSnapshot(InstallSnapshotRequestProto request) throws IOException {
final StateMachine sm = server.getStateMachine();
sm.pause(); // pause the SM right before publishing the snapshot atomically
// TODO: if there is a failure here, we need to rollback the snapshot installation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "rollback the snapshot installation" ?

Comment on lines +217 to 230
final int expectedChunkIndex = nextChunkIndex.get();
if (expectedChunkIndex != snapshotChunkRequest.getRequestIndex()) {
throw new IOException("Unexpected request chunk index: " + snapshotChunkRequest.getRequestIndex()
+ " (the expected index is " + expectedChunkIndex + ")");
}
// Append chunks to a temporary location first. Publish only when done=true.
state.appendSnapshot(request);
nextChunkIndex.incrementAndGet();
// update the committed index
// re-load the state machine if this is the last chunk
if (snapshotChunkRequest.getDone()) {
state.finalizeSnapshot(request);
state.reloadStateMachine(lastIncluded);
chunk0CallId.set(-1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move snapshotManager to SnapshotInstallationHandler. It is only used here.

        final int expectedChunkIndex = nextChunkIndex.get();
        if (expectedChunkIndex != snapshotChunkRequest.getRequestIndex()) {
          throw new IOException("Unexpected request chunk index: " + snapshotChunkRequest.getRequestIndex()
              + " (the expected index is " + expectedChunkIndex + ")");
        }
        // Append chunks to a temporary location first. Publish only when done=true.
        final StateMachine stateMachine = server.getStateMachine();
        snapshotManager.appendSnapshot(request, stateMachine);
        nextChunkIndex.incrementAndGet();
        // update the committed index
        // re-load the state machine if this is the last chunk
        if (snapshotChunkRequest.getDone()) {
          stateMachine.pause(); // pause the SM right before publishing the snapshot atomically
          snapshotManager.finalizeSnapshot(request);

          state.reloadStateMachine(lastIncluded);
          chunk0CallId.set(-1);
        }

Comment on lines 175 to 178
if (!snapshotChunkRequest.getDone()) {
throw new IOException("Cannot finalize incomplete snapshot request: "
+ ServerStringUtils.toInstallSnapshotRequestString(request));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use Preconditions:

    Preconditions.assertTrue(snapshotChunkRequest.getDone());

import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;

public class SnapshotManagerTest {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the other tests,

  • rename it to "TestSnapshotManager"
  • move it to ratis-test.

import static org.mockito.Mockito.when;

public class SnapshotManagerTest {
private static final class TestRaftStorageDirectory implements RaftStorageDirectory {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use RaftStorageDirectoryImpl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants