Two-Phase Commit for Cross-Partition Writes

Why

Currently, mutations that write to tables on different partitions are rejected:

Write to table 'projects' rejected: owned by partition-1, not this node (partition-0).

Most mutations touch 1-3 related tables on the same partition. But some operations span partitions — e.g., creating a project and logging an audit entry, where projects is on partition 1 and audit_log is on partition 0.

How Others Solve This

Vitess (our model)

VTGate detects multi-shard transactions on commit. Sends "prepare" to each shard. Each shard saves a redo log and keeps the transaction open. First shard becomes "Metadata Manager" (MM) storing the commit decision. If all prepares succeed, MM commits. If any fails, MM rolls back. A watcher service recovers stuck transactions.

Source: Vitess Distributed Transactions

TiDB (optimization we already use)

When a transaction touches only one region, skip 2PC entirely — just commit directly. This is the 1PC optimization. We already do this: single-partition commits go through the normal fast path.

Source: TiDB 1PC

CockroachDB (more complex than needed)

Parallel Commits protocol with write intents as provisional values + transaction records. Can respond to client before all intents are resolved. We don't need this complexity because our partitions are coarser (table-level, not range-level).

Source: CockroachDB Parallel Commits

Architecture

Client mutation writes to tables on partitions 0 and 1:

   CommitterClient._commit()
         │
         │  Detects cross-partition writes
         ▼
   TwoPhaseCoordinator
         │
    ┌────┴────┐
    ▼         ▼
 Prepare    Prepare
 Node A     Node B (via gRPC)
    │         │
    │  OCC    │  OCC
    │  check  │  check
    │         │
    ▼         ▼
  Both OK? ──Yes──► Record COMMITTED in NATS KV
    │                     │
    │              ┌──────┴──────┐
    │              ▼             ▼
    │         CommitPrepared  CommitPrepared
    │         Node A         Node B
    │              │             │
    No             ▼             ▼
    │         publish_commit  publish_commit
    ▼         (visible)      (visible)
  Rollback
  all participants

Design

Transaction Classification (Fast Path)

In CommitterClient._commit(), before sending to the Committer, inspect the transaction's write set:

Single partition: Normal commit path. No protocol change. This is TiDB's 1PC optimization.
Cross-partition: Enter 2PC path via TwoPhaseCoordinator.

Prepare Phase

New CommitterMessage::Prepare variant. The coordinator splits writes by partition and sends each partition's subset to its Committer.

Each Committer's handle_prepare():

OCC conflict check against local write log
Assign prepare timestamp via next_commit_ts() (TSO)
Compute writes (index updates, document writes)
Push to PendingWrites (blocks conflicting transactions)
Persist redo log entry for crash recovery
Return PrepareResult with the prepare timestamp

The prepare does NOT call publish_commit — writes are staged but not visible.

Commit Decision

After all partitions prepare successfully:

Coordinator records COMMITTED in NATS KV (atomic, the point of no return)
Sends CommitPrepared to all participants
Each participant writes to persistence, calls publish_commit, deletes redo log

If any partition fails to prepare:

Coordinator records ROLLED_BACK in NATS KV
Sends RollbackPrepared to participants that did prepare
Each participant removes from PendingWrites, deletes redo log

Recovery (Transaction Watcher)

Background task on every node, scanning NATS KV for unresolved 2PC transactions:

COMMITTED but not fully resolved: Coordinator crashed after decision. Watcher sends CommitPrepared to participants with redo logs.
No decision after timeout: Coordinator crashed before deciding. Watcher records ROLLED_BACK and sends rollback.
ROLLED_BACK but not fully resolved: Watcher sends RollbackPrepared to remaining participants.

Node-to-Node Communication

Extend the existing gRPC service with 2PC RPCs:

service TwoPhaseCommit {
    rpc Prepare(PrepareRequest) returns (PrepareResponse);
    rpc CommitPrepared(CommitPreparedRequest) returns (CommitPreparedResponse);
    rpc RollbackPrepared(RollbackPreparedRequest) returns (RollbackPreparedResponse);
}

Nodes need a config mapping: partition_id -> gRPC address.

Implementation Order

Core types: TwoPhaseTransactionId, TwoPhaseState, PrepareResult, TwoPhaseRedoEntry
CommitterMessage variants: Prepare, CommitPrepared, RollbackPrepared
Committer handlers: handle_prepare(), handle_commit_prepared(), handle_rollback_prepared()
CommitterClient extensions: prepare(), commit_prepared(), rollback_prepared()
gRPC service: Extend replication.proto, implement server + client
Coordinator: TwoPhaseCoordinator called from CommitterClient._commit()
Transaction watcher: Background recovery task
Config: NODE_ADDRESSES env var (partition_id=host:port mapping)
Tests: Cross-partition mutation test, coordinator failure recovery test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two-Phase Commit for Cross-Partition Writes

Why

How Others Solve This

Vitess (our model)

TiDB (optimization we already use)

CockroachDB (more complex than needed)

Architecture

Design

Transaction Classification (Fast Path)

Prepare Phase

Commit Decision

Recovery (Transaction Watcher)

Node-to-Node Communication

Implementation Order

Sources

FilesExpand file tree

two-phase-commit.md

Latest commit

History

two-phase-commit.md

File metadata and controls

Two-Phase Commit for Cross-Partition Writes

Why

How Others Solve This

Vitess (our model)

TiDB (optimization we already use)

CockroachDB (more complex than needed)

Architecture

Design

Transaction Classification (Fast Path)

Prepare Phase

Commit Decision

Recovery (Transaction Watcher)

Node-to-Node Communication

Implementation Order

Sources