Horizontal Scaling for Convex

The first horizontal scaling implementation for the Convex open-source backend — reads, writes, and automatic failover.

Convex is a reactive database: real-time subscriptions, in-memory snapshots, OCC with automatic retry, TypeScript function execution. No distributed database — CockroachDB, TiDB, Vitess, YugabyteDB, or Spanner — has all of these. We made it scale horizontally without losing any of them.

The Problem

Convex is a stateful system where the backend process holds the live database in memory. Two instances don't share state — they diverge. The 6 architectural bottlenecks that prevent scaling: single Committer, in-memory SnapshotManager, in-memory WriteLog, single-process subscriptions, single-node OCC, and no distributed consensus. Convex's docs explicitly state self-hosted is single-node by design.

The Solution

We took the best engineering from five distributed databases and combined them:

Problem	Pattern	Source
Global timestamp ordering	Batch TSO via NATS KV — zero network calls in hot path	TiDB PD
Table ownership	VSchema-style partition map — each node owns specific tables	Vitess
Single writer per partition	Committer apply loop — one thread, one channel, serial processing	etcd, TiKV, Kafka
Cross-partition writes	2PC coordinator with prepare/commit/rollback	Vitess
Table number conflicts	Descriptor ID reassignment on receiving node	CockroachDB
Async commit ordering	All writes through single FuturesOrdered pipeline	CockroachDB Raft, TiKV apply worker
Replica timestamp isolation	No TSO or system clock on apply — monotonic counters only	CockroachDB closed timestamp, TiDB resolved-ts
Delta replication	NATS JetStream with durable consumers and self-delta skip	All five systems
System table classification	Classify by what data describes, not which table stores it	CockroachDB system ranges
Automatic leader failover	tikv/raft-rs consensus per partition — sub-second leader election	TiKV, etcd
Raft transport	gRPC with batched messages and exponential backoff retry	TiKV RaftClient
Leadership lifecycle	Committer starts on election, stops on demotion via SoftState	TiKV, CockroachDB
Deployment state replication	GLOBAL table locality — `_modules`, `_udf_config`, `_source_packages` replicate to all nodes	CockroachDB GLOBAL tables, YugabyteDB system catalog
Persistent Raft log	raft-engine — append-only WAL with ~1x write amplification, crash-safe atomic batches	TiKV raft-engine (replaced RocksDB in v6.1)

The combination — real-time subscriptions + in-memory OCC + partitioned multi-writer + delta replication + 2PC — doesn't exist in any of those systems. CockroachDB doesn't have subscriptions. TiDB doesn't have in-memory snapshots. Vitess doesn't have OCC. We kept Convex's unique architecture and grafted distributed database patterns onto it.

Full details: docs/what-we-built.md

Issue journal for regressions, fixes, and validation history: docs/issue-journal.md

Results

Raft Failover Tests

ALL 10 TESTS PASSED

Test 1: All 3 Nodes Healthy
  PASS  Node A (port 3210) healthy
  PASS  Node B (port 3220) healthy
  PASS  Node C (port 3230) healthy

Test 2: Write to Leader
  PASS  Write to Node A succeeded

Test 3: Read from All Nodes
  PASS  Node A sees data: 1 messages
  PASS  All 3 nodes agree: 1 messages

Test 4: Kill Leader, Verify Failover
  PASS  Failover: writes accepted on http://127.0.0.1:3220 after leader kill
  PASS  Data written after failover: B=2 C=2 (pre-kill=1)

Test 5: Restart Killed Node, Verify Rejoin
  PASS  Node A recovered: sees 2 messages (>=1)
  PASS  All nodes converged after rejoin: 2 messages

Based on CockroachDB roachtest failover/non-system/crash, TiKV fail-rs chaos testing, and YugabyteDB Jepsen nightly resilience benchmarks.

Write Scaling Tests

ALL 77 TESTS PASSED — 3,823 messages | 3,069 tasks | 1,390 sustained writes/node

 1. Cross-partition data verification     (Vitess VDiff)         — PASS
 2. Bank invariant — single table         (CockroachDB Jepsen)   — PASS
 3. Bank invariant — multi-table          (TiDB bank-multitable) — PASS
 4. Partition enforcement (5 subtests)    (Vitess Single mode)   — PASS
 5. Concurrent write scaling              (CockroachDB KV)       — PASS 176 writes/sec
 6. Monotonic reads                       (TiDB monotonic)       — PASS
 7. Node restart recovery                 (TiDB kill -9)         — PASS
 8. Idempotent re-run                     (CockroachDB workload) — PASS
 9. Two-phase commit cross-partition      (Vitess 2PC)           — PASS
10. Rapid-fire writes 50/node             (Jepsen stress)        — PASS
11. Write-then-immediate-read             (stale read detection) — PASS
12. Double node restart                   (CockroachDB nemesis)  — PASS
13. Post-chaos invariant check            (workload check)       — PASS
14. Sequential ordering                   (Jepsen sequential)    — PASS
15. Set completeness (100 elements)       (Jepsen set)           — PASS
16. Concurrent counter                    (Jepsen counter)       — PASS
17. Write-then-cross-node-read            (cross-node stale)     — PASS
18. Interleaved cross-partition reads     (read skew detection)  — PASS
19. Large batch write (50 docs)           (atomicity)            — PASS
20. Full cluster restart                  (CockroachDB nemesis)  — PASS
21. Sustained writes 30 seconds           (endurance)            — PASS
22. Duplicate insert idempotency          (correctness)          — PASS
23. Mid-suite exhaustive invariant check  (workload check)       — PASS
24. Single-key register                   (CockroachDB register) — PASS
25. Disjoint record ordering              (CockroachDB comments) — PASS
26. NATS partition simulation             (Chaos Mesh)           — PASS
27. Write during deploy                   (deploy safety)        — PASS
28. Empty table cross-node query          (boundary)             — PASS
29. Max batch size 200 docs               (boundary)             — PASS
30. Null and empty field values           (boundary)             — PASS
31. Concurrent writes from both nodes     (race condition)       — PASS
32. Rapid deploy cycle 3x                 (deploy stability)     — PASS
33. Read during active replication        (consistency)          — PASS
34. TSO monotonicity after restart        (TiDB TSO)             — PASS
35. Single document read-modify-write     (register)             — PASS
36. Write skew detection                  (G2 anomaly)           — PASS
37. Ultimate final invariant check        (workload check)       — PASS

Every test pattern comes from a real bug found by Jepsen in a production database.

Architecture

Raft Consensus (Automatic Failover)

Partition 0 — 3-node Raft group (tikv/raft-rs):

  Node A (leader)  ────Raft────▶ Node B (follower)
        │          ────Raft────▶ Node C (follower)
        │
        ▼
   Committer active         Committers dormant
   Accepts writes           Reject writes (redirect)
        │
        ├── NATS delta ──▶ Node B applies replica delta
        └── NATS delta ──▶ Node C applies replica delta

  Node A dies:
    Node B elected leader (~1s) → Committer activates → accepts writes
    Node C remains follower → applies deltas from Node B
    Node A restarts → rejoins as follower → converges via NATS

Each partition is a 3-node Raft group. The leader runs the Committer. If the leader dies, followers elect a new leader within ~1 second and the Committer activates automatically. Zero manual intervention, zero data loss. Deployment state (_modules, _udf_config, _source_packages) replicates to all nodes via the CockroachDB GLOBAL table locality pattern so every node can serve queries.

Write Scaling (Partitioned Multi-Writer)

                    ┌──────────────────────────────┐
                    │   Global Timestamp Oracle     │
                    │   (NATS KV, TiDB PD pattern)  │
                    └──────────┬───────────────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
        ┌─────┴──────┐        │        ┌───────┴─────┐
        │  Node A    │        │        │   Node B    │
        │  partition 0│        │        │  partition 1│
        │  messages  │        │        │  projects   │
        │  users     │        │        │  tasks      │
        └─────┬──────┘        │        └───────┬─────┘
              │        ┌──────┴──────┐         │
              └───────▶│    NATS     │◀────────┘
                       │  JetStream  │
                       └──────┬──────┘
                              │
              ┌───────────────┼───────────────┐
              ▼                               ▼
         Node A sees                     Node B sees
         all data                        all data

Each node owns specific tables and is the single Committer for those tables. Both nodes consume all NATS deltas for a complete read view. Writes to non-owned tables are rejected with partition routing info. Cross-partition writes go through 2PC.

Read Scaling (Primary-Replica)

Client ──mutation──▶ Primary ──persist──▶ PostgreSQL
                        │
                        ├──publish delta──▶ NATS JetStream
                        │                        │
Client ──query──▶ Replica ◀──consume delta───────┘

One Primary handles writes, multiple Replicas serve reads. Replicas remap TabletIds from the Primary's namespace to their own using the tablet_id_to_table_name mapping in each CommitDelta.

Quick Start

One Docker Compose file, two profiles — same as CockroachDB, etcd, and YugabyteDB. Raft consensus is always on (single-node is a Raft group of 1).

Run

cd self-hosted/docker

# 1 node (dev/test)
docker compose --profile single up

# 6 nodes — 2 partitions × 3 Raft nodes (read + write scaling + HA)
docker compose --profile cluster up

Images are published to ghcr.io/martinkalema/convex-horizontal-scaling — no local build needed.

Test

cd self-hosted/docker

./test.sh              # All tests (87 tests — scaling + failover)
./test.sh scaling      # Write scaling only (77 tests)
./test.sh failover     # Raft failover only (10 tests)

Formal Jepsen Harness

This repository contains the backend fork, Docker cluster topology, and the Jepsen-inspired shell and integration tests under self-hosted/docker/.

The dedicated formal Jepsen harness is under development and lives in a separate repository:

GitHub: https://github.com/MartinKalema/convex-jepsen-tests
Local working copy: /Users/martin/Desktop/convex-jepsen-tests

That separate repository is where the Clojure Jepsen suite, reusable workloads, and fault injectors will evolve. This repository remains the home for the Convex backend implementation itself and its integration/chaos test scripts.

Deploy Functions

docker compose --profile cluster exec node-p0a ./generate_admin_key.sh
npx convex deploy --url http://127.0.0.1:3210 --admin-key <KEY>

Resource Requirements

Per-node requirements, following the format used by CockroachDB, YugabyteDB, and etcd.

	Dev/Test	Production
CPU	2 vCPUs	4+ vCPUs
RAM	2 GB	8+ GB
Storage	10 GB (HDD ok)	50+ GB SSD
Network	100 Mbps	1 Gbps

Total cluster resources:

Profile	Nodes	Min CPU	Min RAM	Min Storage
`single`	1 backend + postgres + NATS	4 vCPUs	4 GB	20 GB
`cluster`	6 backends + postgres + NATS	16 vCPUs	16 GB	80 GB

Observed idle memory usage per backend node: ~20 MB. Under load with in-memory snapshots: scales with dataset size (similar to CockroachDB's range cache).

Ports

Service	Single	Cluster
Partition 0 node A API	3210	3210
Partition 0 node B API	—	3220
Partition 0 node C API	—	3230
Partition 1 node A API	—	3310
Partition 1 node B API	—	3320
Partition 1 node C API	—	3330
gRPC (Raft + 2PC)	50051	50051–50063
Dashboard	6791	6791
PostgreSQL	5433	5433
NATS	4222	4222

Key Components

Component	File	What it does
RaftNode	`raft_node.rs`	Raft loop: tick, receive, propose, process Ready, advance. Leadership callbacks via SoftState
RaftPartitionManager	`raft_partition.rs`	Wraps RaftNode, activates/deactivates Committer on leader election/demotion
RaftTransport	`raft_transport.rs`	gRPC transport with batched messages, exponential backoff retry (TiKV RaftClient pattern)
RaftStorage	`raft_storage.rs`	Persistent Raft log backed by TiKV raft-engine — entries + hard state survive restarts
RaftStateMachine	`raft_state_machine.rs`	Serialization format for Raft log entries, bridges committed entries to Committer
CommitDelta	`commit_delta.rs`	Captures everything changed in a transaction — documents, indexes, table mappings
NatsDistributedLog	`nats_distributed_log.rs`	Publishes/subscribes deltas via NATS JetStream with per-partition subjects and self-delta skip
apply_replica_delta	`committer.rs`	Classifies updates as GLOBAL or node-local, creates tables with reassigned numbers, applies through Raft-pattern pipeline
BatchTimestampOracle	`timestamp_oracle.rs`	Reserves timestamp ranges from NATS KV via atomic CAS. Zero network calls in hot path
PartitionMap	`partition.rs`	Table-to-partition assignment. System tables always on partition 0
TwoPhaseCoordinator	`two_phase_coordinator.rs`	Detects cross-partition writes, orchestrates prepare/commit/rollback
TwoPhaseCommitService	`two_phase_service.rs`	gRPC Prepare/CommitPrepared/RollbackPrepared for remote partitions
Transaction Watcher	`two_phase_watcher.rs`	Scans NATS KV for stuck 2PC transactions, resolves via commit or rollback

Tests

346 unit tests + 87 integration tests across 42 categories.

Raft Failover (10 tests, 5 categories)

Leader election, write to leader, read from all nodes, kill leader + verify failover, restart killed node + verify rejoin. Based on CockroachDB roachtest failover/non-system/crash, TiKV fail-rs, and YugabyteDB Jepsen resilience benchmarks.

Write Scaling (77 tests, 37 categories)

Every test pattern from CockroachDB's 7 nightly Jepsen workloads (bank, register, sequential, set, monotonic, G2, comments), TiDB's Jepsen suite (bank-multitable, monotonic, stale read), YugabyteDB's Jepsen tests (counter, linearizable set), Vitess (VDiff, partition enforcement, 2PC), CockroachDB roachtest (KV scaling, nemesis, workload check), Chaos Mesh (NATS partition), Elle anomaly classes (read skew, write skew), and boundary testing (empty tables, null fields, 200-doc batch).

Full test details: docs/write-scaling-tests.md

Documentation

Document	Contents
Production Deployment	Kubernetes (GKE, EKS), bare VMs (EC2), load balancing, persistent storage, peer discovery
Raft Integration	tikv/raft-rs integration design — Raft loop, storage, transport, state machine, leader lifecycle
What We Built	What we took from each distributed database and what's new
Write Scaling Research	Vitess, TiDB, CockroachDB comparison
Two-Phase Commit Design	2PC architecture with Vitess/TiDB/CockroachDB patterns
TSO Timestamp Fix	Three timestamp ordering fixes with distributed database research
Write Scaling Tests	All 37 test categories with 77 assertions
Engineering Changes	Every file changed, every architectural decision
Architecture Analysis	The 6 bottlenecks in the original codebase
Convex Internals	How the Committer, SnapshotManager, WriteLog, Subscriptions, and OCC work
Implementation Plan	Primary-Replica architecture design
Full Scaling Proposal	Complete partitioned-write architecture proposal
Scalability Research	Community research with 25+ source URLs

Configuration

Variable	Description	Example
`RAFT_NODE_ID`	This node's Raft ID (1-based)	`1`
`RAFT_PEERS`	All Raft peers with gRPC addresses	`1=http://node-a:50051,2=http://node-b:50051,3=http://node-c:50051`
`PARTITION_ID`	This node's partition number	`0`
`PARTITION_MAP`	Table-to-partition assignment	`messages=0,users=0,projects=1,tasks=1`
`NUM_PARTITIONS`	Total partitions in cluster	`2`
`NATS_URL`	NATS JetStream connection	`nats://nats:4222`
`NODE_ADDRESSES`	gRPC addresses for 2PC	`0=node-a:50051,1=node-b:50051`
`INSTANCE_NAME`	Unique node identifier	`convex-node-a`
`REPLICATION_MODE`	Node role	`primary`

License

The original Convex backend is licensed under FSL-1.1-Apache-2.0. Our modifications follow the same license.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.cargo		.cargo
.config		.config
crates		crates
demo		demo
docs		docs
npm-packages		npm-packages
scripts		scripts
self-hosted		self-hosted
.dockerignore		.dockerignore
.gitignore		.gitignore
.nvmrc		.nvmrc
.nvmrc.22		.nvmrc.22
.prettierignore		.prettierignore
.prettierrc.js		.prettierrc.js
BUILD.md		BUILD.md
CONTRIBUTING.md		CONTRIBUTING.md
CONVEX-README.md		CONVEX-README.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Justfile		Justfile
LICENSE.md		LICENSE.md
README.md		README.md
dprint.json		dprint.json
rust-toolchain		rust-toolchain
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Horizontal Scaling for Convex

The Problem

The Solution

Results

Raft Failover Tests

Write Scaling Tests

Architecture

Raft Consensus (Automatic Failover)

Write Scaling (Partitioned Multi-Writer)

Read Scaling (Primary-Replica)

Quick Start

Run

Test

Formal Jepsen Harness

Deploy Functions

Resource Requirements

Ports

Key Components

Tests

Raft Failover (10 tests, 5 categories)

Write Scaling (77 tests, 37 categories)

Documentation

Configuration

License

About

Uh oh!

Releases 2

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Horizontal Scaling for Convex

The Problem

The Solution

Results

Raft Failover Tests

Write Scaling Tests

Architecture

Raft Consensus (Automatic Failover)

Write Scaling (Partitioned Multi-Writer)

Read Scaling (Primary-Replica)

Quick Start

Run

Test

Formal Jepsen Harness

Deploy Functions

Resource Requirements

Ports

Key Components

Tests

Raft Failover (10 tests, 5 categories)

Write Scaling (77 tests, 37 categories)

Documentation

Configuration

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages