Currently, mutations that write to tables on different partitions are rejected:
Write to table 'projects' rejected: owned by partition-1, not this node (partition-0).
Most mutations touch 1-3 related tables on the same partition. But some operations span partitions — e.g., creating a project and logging an audit entry, where projects is on partition 1 and audit_log is on partition 0.
VTGate detects multi-shard transactions on commit. Sends "prepare" to each shard. Each shard saves a redo log and keeps the transaction open. First shard becomes "Metadata Manager" (MM) storing the commit decision. If all prepares succeed, MM commits. If any fails, MM rolls back. A watcher service recovers stuck transactions.
Source: Vitess Distributed Transactions
When a transaction touches only one region, skip 2PC entirely — just commit directly. This is the 1PC optimization. We already do this: single-partition commits go through the normal fast path.
Source: TiDB 1PC
Parallel Commits protocol with write intents as provisional values + transaction records. Can respond to client before all intents are resolved. We don't need this complexity because our partitions are coarser (table-level, not range-level).
Source: CockroachDB Parallel Commits
Client mutation writes to tables on partitions 0 and 1:
CommitterClient._commit()
│
│ Detects cross-partition writes
▼
TwoPhaseCoordinator
│
┌────┴────┐
▼ ▼
Prepare Prepare
Node A Node B (via gRPC)
│ │
│ OCC │ OCC
│ check │ check
│ │
▼ ▼
Both OK? ──Yes──► Record COMMITTED in NATS KV
│ │
│ ┌──────┴──────┐
│ ▼ ▼
│ CommitPrepared CommitPrepared
│ Node A Node B
│ │ │
No ▼ ▼
│ publish_commit publish_commit
▼ (visible) (visible)
Rollback
all participants
In CommitterClient._commit(), before sending to the Committer, inspect the transaction's write set:
- Single partition: Normal commit path. No protocol change. This is TiDB's 1PC optimization.
- Cross-partition: Enter 2PC path via
TwoPhaseCoordinator.
New CommitterMessage::Prepare variant. The coordinator splits writes by partition and sends each partition's subset to its Committer.
Each Committer's handle_prepare():
- OCC conflict check against local write log
- Assign prepare timestamp via
next_commit_ts()(TSO) - Compute writes (index updates, document writes)
- Push to PendingWrites (blocks conflicting transactions)
- Persist redo log entry for crash recovery
- Return PrepareResult with the prepare timestamp
The prepare does NOT call publish_commit — writes are staged but not visible.
After all partitions prepare successfully:
- Coordinator records
COMMITTEDin NATS KV (atomic, the point of no return) - Sends
CommitPreparedto all participants - Each participant writes to persistence, calls
publish_commit, deletes redo log
If any partition fails to prepare:
- Coordinator records
ROLLED_BACKin NATS KV - Sends
RollbackPreparedto participants that did prepare - Each participant removes from PendingWrites, deletes redo log
Background task on every node, scanning NATS KV for unresolved 2PC transactions:
- COMMITTED but not fully resolved: Coordinator crashed after decision. Watcher sends
CommitPreparedto participants with redo logs. - No decision after timeout: Coordinator crashed before deciding. Watcher records
ROLLED_BACKand sends rollback. - ROLLED_BACK but not fully resolved: Watcher sends
RollbackPreparedto remaining participants.
Extend the existing gRPC service with 2PC RPCs:
service TwoPhaseCommit {
rpc Prepare(PrepareRequest) returns (PrepareResponse);
rpc CommitPrepared(CommitPreparedRequest) returns (CommitPreparedResponse);
rpc RollbackPrepared(RollbackPreparedRequest) returns (RollbackPreparedResponse);
}Nodes need a config mapping: partition_id -> gRPC address.
- Core types:
TwoPhaseTransactionId,TwoPhaseState,PrepareResult,TwoPhaseRedoEntry - CommitterMessage variants:
Prepare,CommitPrepared,RollbackPrepared - Committer handlers:
handle_prepare(),handle_commit_prepared(),handle_rollback_prepared() - CommitterClient extensions:
prepare(),commit_prepared(),rollback_prepared() - gRPC service: Extend replication.proto, implement server + client
- Coordinator:
TwoPhaseCoordinatorcalled fromCommitterClient._commit() - Transaction watcher: Background recovery task
- Config:
NODE_ADDRESSESenv var (partition_id=host:port mapping) - Tests: Cross-partition mutation test, coordinator failure recovery test