diff --git a/docs/database-schema/database-schema-overview.md b/docs/database-schema/database-schema-overview.md index 7dfefca..9282070 100644 --- a/docs/database-schema/database-schema-overview.md +++ b/docs/database-schema/database-schema-overview.md @@ -9,10 +9,12 @@ ## Revision History -| Date | Sections | Driver | Summary | -| ---------- | -------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| 2026-06-11 | §3.2 §3.0, Schema source-of-truth refs | [Task 0060](../../lore/1-tasks/active/0060_FEATURE_prices-clickhouse-crate-combined-backfill-sizing/README.md) | **Schema implemented as the `packages/prices-clickhouse` crate** (`schema/init.sql` = 12 tables, source of truth; `rollups.sql` = refreshable-MV chain; `preroll.sql` = full-range re-aggregate). Built + applied on a local ClickHouse 25.6 and validated by a combined SDEX + soroban (oracle) backfill. **Sizing finding:** measured ~3.6 KB/ledger over a 10k-ledger sample (≈48× the prior 74 B/ledger task-0046 estimate), driven by ~4,343-asset pair diversity (317k 1m candles) and short-window rollups that don't yet amortize. `assets` implemented with `String` (not `FixedString`) columns to match the writer contract. See task 0060 `notes/G-measurement-results.md`. | -| 2026-05-20 | All sections + Appendices A & B | [ADR 0007](../../lore/2-adrs/0007_live-data-sink-on-shared-hetzner-clickhouse.md) (accepted) · [Task 0049](../../lore/1-tasks/active/0049_DOCS_overview-rewrite-for-adr-0007.md) | **Live data sink flipped from Prices-owned RDS PostgreSQL 16 to BE's shared Hetzner ClickHouse cluster** (separate `prices` database, isolated via CH multi-tenant primitives). Schema rewritten to per-source `ReplacingMergeTree(version)` rows on per-granularity tables (`price_ohlcv_1m`, `_15m`, …, `_1M`) feeding a materialised-view rollup chain that eliminates the OHLCV Rollup Lambda. Cleanup becomes `ALTER TABLE … DROP PARTITION`. All 14 mermaid blocks (including Appendices A and B) updated to ClickHouse types, engines, sort keys, MV chain, and the mTLS edge. RDS sizing/scaling ladder removed; Hetzner cost-share added (~$1-2/env/mo per task 0046). | +| Date | Sections | Driver | Summary | +| ---------- | -------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 2026-06-22 | §3.2 (`close_usd` col + views), §13 | [Task 0061](../../lore/1-tasks/archive/0061_FEATURE_historical-usd-close-price-series/README.md) | **Documented the historical USD close surface.** Added the `close_usd Decimal(38,14) DEFAULT 0` column (`= oracle_usd × close`, baked in at enrichment time) to the `price_ohlcv_*` DDL, and a new §3.2 subsection covering the BE-facing read-surface VIEWs — `prices.price_usd_series` / `_1h` (volume-weighted `close_usd` per natural identity + bucket), `prices.usd_reference` / `_1h` (per-bucket XLM/USDC "reference is up at T" signal), and `prices.identity_by_contract` (SAC read-seam resolver) — with the read-time `ok` / `no_asset_price` / `no_reference` status discriminator, caller-owned grain selection, and the load-bearing USDC-issuer literal. Source of truth: `packages/prices-clickhouse/schema/views.sql`. | +| 2026-06-19 | §1.2, §8.3, §8.5 | [Task 0063](../../lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/README.md) | **Sizing + cost-share corrected from measurement.** Fresh 64k-ledger backfill (62016000-62079999) measured **114 MiB / ~1,872 B/ledger**; combined with task 0060's 10k+100k runs the real footprint is **~1.9-3.7 KB/ledger / ~3.5-6 GB/yr** (activity-dependent), superseding the 0046 ~74 B/ledger / ~0.45 GB/yr estimate. Cost-share raised ~$1-2 → **~$8-11/env/mo** (~10-15% pro-rata). Added a shared-vs-dedicated-container cost table; dedicated container ~2× cost **and** breaks BE's in-cluster `price_usd_series` JOIN — shared stays correct. See task 0063 `notes/G-64k-sizing-remeasure.md`. | +| 2026-06-11 | §3.2 §3.0, Schema source-of-truth refs | [Task 0060](../../lore/1-tasks/active/0060_FEATURE_prices-clickhouse-crate-combined-backfill-sizing/README.md) | **Schema implemented as the `packages/prices-clickhouse` crate** (`schema/init.sql` = 12 tables, source of truth; `rollups.sql` = refreshable-MV chain; `preroll.sql` = full-range re-aggregate). Built + applied on a local ClickHouse 25.6 and validated by a combined SDEX + soroban (oracle) backfill. **Sizing finding:** measured ~3.6 KB/ledger over a 10k-ledger sample (≈48× the prior 74 B/ledger task-0046 estimate), driven by ~4,343-asset pair diversity (317k 1m candles) and short-window rollups that don't yet amortize. `assets` implemented with `String` (not `FixedString`) columns to match the writer contract. See task 0060 `notes/G-measurement-results.md`. | +| 2026-05-20 | All sections + Appendices A & B | [ADR 0007](../../lore/2-adrs/0007_live-data-sink-on-shared-hetzner-clickhouse.md) (accepted) · [Task 0049](../../lore/1-tasks/active/0049_DOCS_overview-rewrite-for-adr-0007.md) | **Live data sink flipped from Prices-owned RDS PostgreSQL 16 to BE's shared Hetzner ClickHouse cluster** (separate `prices` database, isolated via CH multi-tenant primitives). Schema rewritten to per-source `ReplacingMergeTree(version)` rows on per-granularity tables (`price_ohlcv_1m`, `_15m`, …, `_1M`) feeding a materialised-view rollup chain that eliminates the OHLCV Rollup Lambda. Cleanup becomes `ALTER TABLE … DROP PARTITION`. All 14 mermaid blocks (including Appendices A and B) updated to ClickHouse types, engines, sort keys, MV chain, and the mTLS edge. RDS sizing/scaling ladder removed; Hetzner cost-share added (~$1-2/env/mo per task 0046). | --- @@ -153,9 +155,11 @@ are pushed to the Hetzner cluster via separate post-backfill tools. **Why ClickHouse on a BE-shared cluster (ADR 0007):** - Eliminates one production DB the prices-api would otherwise own (RDS). -- Cost-share at empirical scale (~0.45 GB/yr, ~74 bytes/ledger, 14.8× compression - per task 0046) is ~1-2% pro-rata, i.e. ~$1–2/env/mo vs. $12+/mo for the - smallest RDS instance and substantially more at any scale-up tier. +- Cost-share at **measured** scale (~3.5-6 GB/yr; ~1.9-3.7 KB/ledger across three + backfill windows — tasks 0060 + 0063, **superseding** the 0046 ~74 B/ledger + estimate) is ~10-15% pro-rata, i.e. ~$8-11/env/mo — still far below the $12+/mo + smallest RDS instance and substantially more at any scale-up tier, and trivial + for a 1 TB Hetzner box. - Columnar storage + `LowCardinality(String)` for the `source` column drives down per-row footprint for the per-source OHLCV shape (ADR 0004). - Materialised-view rollup chain replaces a scheduled Lambda — one fewer moving @@ -393,6 +397,12 @@ CREATE TABLE prices.price_ohlcv_1m ( -- multiplied into volume_quote_usd by the -- enrichment Lambda (task 0026) volume_quote_usd Decimal(38, 14) DEFAULT 0, -- USD-denominated; filled by task 0026 + close_usd Decimal(38, 14) DEFAULT 0, -- historical USD close (task 0061); + -- close_usd = oracle_usd × close, baked + -- in at enrichment time (DEFAULT 0 until + -- the enrichment pass fills it, mirroring + -- volume_quote_usd). Surfaced to BE via + -- the prices.price_usd_series* views below vwap Decimal(38, 14), -- single-source, single-minute VWAP -- (volume_quote / volume_base); -- see §5.5 of the main overview @@ -563,6 +573,89 @@ target table (`_1d`, `_1h`, …) — they coexist with the rollup MVs, whose bounded refresh window (see §3.2) only re-aggregates _recent_ buckets, leaving historical backfilled partitions untouched. +#### Read-surface views — historical USD close series (task 0061) + +The `close_usd` column above is the per-candle historical USD price BE +requested (BE task 0199 / our task 0061 — see +[R-historical-usd-close-design](../../lore/1-tasks/archive/0061_FEATURE_historical-usd-close-price-series/notes/R-historical-usd-close-design.md)). +BE does **not** read the OHLCV tables directly; the contract is a set of +**plain `VIEW`s** (no special CH version needed, unlike the refreshable rollup +MVs) defined in `packages/prices-clickhouse/schema/views.sql`, applied after +`init.sql`. They resolve the internal `asset_id` surrogate to the **portable +natural Stellar identity** (`asset_kind ∈ ('native','credit','contract')`, +`asset_code`, `issuer_address`, `contract_address`) so the surface survives +asset-id reassignment. + +| View | Grain | Returns | Purpose | +| ----------------------------- | ------ | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | +| `prices.price_usd_series` | daily | `close_usd` per (natural identity, day bucket) | long-range USD charts | +| `prices.price_usd_series_1h` | hourly | `close_usd` per (natural identity, hour bucket) | read-time TVL keyed to a ledger's `closed_at` | +| `prices.usd_reference` | daily | `xlm_usd` per day bucket | per-bucket "USD reference is up at T" signal | +| `prices.usd_reference_1h` | hourly | `xlm_usd` per hour bucket | hourly companion to the above | +| `prices.identity_by_contract` | — | contract → natural identity | SAC read-seam resolver (§12.4): map a Soroban-DEX pool leg's contract address to the natural identity to look up | + +```sql +-- One volume-weighted USD close per (natural identity, day bucket). The +-- cross-source/cross-quote collapse: volume-weighted close_usd over every candle +-- of the asset in the bucket (ADR 0004 per-source rows merge at read time). Only +-- priced rows (close_usd > 0). _1h is identical but reads price_ohlcv_1h. +CREATE VIEW IF NOT EXISTS prices.price_usd_series AS +SELECT + multiIf( + a.contract_address != '', 'contract', + a.asset_code = 'XLM' AND a.issuer_address = '', 'native', + 'credit') AS asset_kind, + if(a.contract_address != '', '', a.asset_code) AS asset_code, + if(a.contract_address != '', '', a.issuer_address) AS issuer_address, + a.contract_address AS contract_address, + p.timestamp AS bucket, + CAST(sum(toFloat64(p.close_usd) * toFloat64(p.volume_base)) + / nullIf(sum(toFloat64(p.volume_base)), 0) AS Decimal(38, 14)) AS close_usd +FROM prices.price_ohlcv_1d AS p FINAL +INNER JOIN prices.assets AS a FINAL ON a.asset_id = p.asset_id +WHERE p.close_usd > 0 +GROUP BY asset_kind, asset_code, issuer_address, contract_address, bucket; + +-- The XLM/USDC volume-weighted close (XLM's USD price under the USDC≡$1 peg) per +-- bucket. A bucket's PRESENCE is the durable "USD reference is up at T" signal. +-- Reads `close` (always present from the backfill), independent of enrichment. +CREATE VIEW IF NOT EXISTS prices.usd_reference AS +SELECT + p.timestamp AS bucket, + CAST(sum(toFloat64(p.close) * toFloat64(p.volume_base)) + / nullIf(sum(toFloat64(p.volume_base)), 0) AS Decimal(38, 14)) AS xlm_usd +FROM prices.price_ohlcv_1d AS p FINAL +INNER JOIN prices.assets AS base FINAL ON base.asset_id = p.asset_id +INNER JOIN prices.assets AS quote FINAL ON quote.asset_id = p.quote_asset_id +WHERE base.asset_code = 'XLM' AND base.issuer_address = '' AND base.contract_address = '' + AND quote.asset_code = 'USDC' + AND quote.issuer_address = 'GA5ZSEJYB37JRC5AVCIA5MOP4RHTM335X2KGX3IHOJAPP5RE34K4KZVN' + AND p.close > 0 +GROUP BY p.timestamp; +``` + +**Read-time status discriminator (computed by the reader, not stored).** A view +cannot enumerate (asset × bucket) combinations that never traded, so a miss is a +missing row (NULL after the consumer's `LEFT JOIN`), never an error and never a +dropped row. For a lookup of (identity I, bucket T), the consumer LEFT JOINs +`price_usd_series` against `usd_reference` at the matching grain: + +- `ok` — row present in `price_usd_series` for (I, T). +- `no_asset_price` — (I, T) absent **but** `usd_reference` has bucket T (the USD + reference is up; partial TVL is valid). +- `no_reference` — (I, T) absent **and** `usd_reference` has no bucket T + (systemic blackout — every XLM-pivot asset is NULL). + +**Grain ownership.** Grain selection is the **caller's** — the consumer JOINs +whichever grain (`_1h` vs daily) its query needs; the views stay a dumb, fast, +retention-agnostic surface. The finest-retained-for-T routing lives one layer up +in the point-lookup HTTP endpoint (`price_usd_at`, task 0040), not in the views. + +> **USDC issuer literal is load-bearing.** The issuer address in `usd_reference` +> is a hand-synced copy of `prices_clickhouse::USDC_ISSUER` (SQL cannot reference +> a Rust const). If the canonical address ever changes, update it in the views +> **and** that const together, or the views and the writer diverge. + ### 3.3 `prices.current_prices` — Materialised / cached current state One row per asset. Written by the **Current Price Updater** Lambda @@ -1349,18 +1442,26 @@ instance on a single Hetzner box behind Caddy:443). Prices-api joins as a second tenant via its own `prices` database, isolated by ClickHouse's native multi-tenant primitives (database, user, quota, profile). -| Metric | Value | Source | -| ---------------------------------- | ----------------------------------------------------------- | --------------------------------------- | -| Prices-api storage footprint | ~0.45 GB/year flat-growth | Task 0046 empirical | -| Average per-ledger storage | ~74 bytes/ledger | Task 0046 empirical | -| Compression ratio (LZ4 + sort-key) | ~14.8× | Task 0046 empirical on `soroban_events` | -| Write rate | ~1 INSERT per ledger (~12k/day per env at mainnet cadence) | §6.1 | -| Read rate | API-Gateway-throttled ≤100 req/s per key, cached at gateway | §8.2 | +| Metric | Value | Source | +| ---------------------------- | --------------------------------------------------------------------------- | -------------------------- | +| Prices-api storage footprint | **~3.5-6 GB/year** (realistic, retention-amortised) | Tasks 0060 + 0063 measured | +| Average per-ledger storage | **~1.9-3.7 KB/ledger** (activity-dependent, ~2× spread) | Tasks 0060 + 0063 measured | +| Strongest size lever | Retention-cap `_1h`/`_4h` → bounds DB at ~9 GB @ 10yr (vs ~43 GB unbounded) | Task 0060 measured | +| Write rate | ~1 INSERT per ledger (~12k/day per env at mainnet cadence) | §6.1 | +| Read rate | API-Gateway-throttled ≤100 req/s per key, cached at gateway | §8.2 | + +> **Sizing superseded (2026-06-19).** The original ~74 B/ledger / ~0.45 GB/yr +> figure was the task-0046 _per-event estimate_. Three ground-truth backfill +> measurements (0060: 10k @ 62966000+ and 100k @ 62882700+; 0063: 64k @ +> 62016000+) put the real footprint at **~1.9-3.7 KB/ledger** — ~25-50× higher, +> driven by trading-pair diversity (thousands of low-volume tokens, unfiltered) +> rather than ledger count. Still small in absolute terms. See task 0063 +> `notes/G-64k-sizing-remeasure.md` and task 0060 `notes/G-measurement-results.md`. Hardware sizing, OS-level tuning, and any vertical/horizontal scaling -decisions are owned by BE. Prices-api's contribution to the box is -empirically light; the tier choice is driven by BE's `default.*` footprint, -not by `prices.*`. +decisions are owned by BE. Prices-api's contribution to the box is now +~10-15% of the data-plane storage (still well within a single Hetzner box); +the tier choice is driven by BE's `default.*` footprint, not by `prices.*`. ### 8.4 Capacity contention — fallback to sidecar CH @@ -1383,15 +1484,15 @@ of the prices-api budget at any traffic level. Monthly running cost (low traffic, post-backfill): -| Service | Estimated Cost | Notes | -| ------------------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| Hetzner CH cost-share for `prices` DB | ~$1–$2/env/mo | Opening proposal ~1-2% pro-rata per task 0046; flat fee acceptable up to ~$5/env per the brief without changing the recommendation. D12 commercial follow-up | +| Service | Estimated Cost | Notes | +| ------------------------------------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Hetzner CH cost-share for `prices` DB | **~$8–$11/env/mo** | ~10-15% pro-rata on the **measured** ~3.5-6 GB/yr (tasks 0060 + 0063), superseding the ~$1-2/0046 figure. A dedicated prices CH container (ADR 0007 Alt-3) would run ~$16-25/env/mo (same disk + a reserved CH process) **and** break BE's in-cluster `price_usd_series` JOIN — so shared stays correct. D12 commercial follow-up | Backfill period additional costs (one-time, during 13-week project): -| Item | Configuration | One-time Cost | -| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------- | --------------- | -| Cloud DB during backfill | No RDS upgrade required (ADR 0007); the bursty pushes hit Hetzner CH instead. Empirically <1 GB extra (task 0046) — no marginal cost-share change | **$0 marginal** | +| Item | Configuration | One-time Cost | +| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------- | +| Cloud DB during backfill | No RDS upgrade required (ADR 0007); the bursty pushes hit Hetzner CH instead. Measured ~114 MiB per 64k ledgers (task 0063) — a full recent-history backfill is single-digit GB, absorbed by BE's box with no marginal cost-share change | **$0 marginal** | Scaled-up at high traffic (DB-relevant): @@ -1567,14 +1668,15 @@ criteria from the delivery plan, restated against the canonical ## 13. Quick Reference — Tables at a Glance -| Table | Engine | Partitioning | Sort key | Written by | Read by | -| ---------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | -| `prices.assets` | `ReplacingMergeTree(updated_at)` | none | `(asset_code, issuer_address, contract_address)` | Asset Discovery Lambda; Prices Ledger Processor (inline) | All asset/price endpoints | -| `prices.price_ohlcv_1m` | `ReplacingMergeTree(version)` | `toYYYYMM(timestamp)` | `(asset_id, quote_asset_id, source, timestamp)` | Prices Ledger Processor; backfill streams (sdex-cloud-push, soroban-amm completion push); Cleanup Worker (DROP PARTITION) | `GET /ohlcv` (1m timeframe), Current Price Updater, MV chain feeding rolled granularities | -| `prices.price_ohlcv_15m` / `_1h` / `_4h` / `_1d` / `_1w` / `_1M` | `ReplacingMergeTree(version)` | `toYYYYMM(timestamp)` | `(asset_id, quote_asset_id, source, timestamp)` | MV chain on `_1m`; backfill streams (for pre-rolled ranges) | `GET /ohlcv` (rolled granularities) | -| `prices.current_prices` | `ReplacingMergeTree(updated_at)` | none | `(asset_id)` | Current Price Updater Lambda | `GET /assets`, `GET /price`, `POST /prices/batch` | -| `prices.oracle_prices` | `ReplacingMergeTree` | `toYYYYMM(timestamp)` | `(asset_id, oracle_name, timestamp)` | Oracle Fetcher Lambda; Cleanup Worker (DROP PARTITION) | `GET /oracles/{asset}` | -| `prices.backfill_progress` | `ReplacingMergeTree(updated_at)` | none | `(task_name)` | Backfill cloud-push step — one row per stream | `GET /backfill/status` | +| Table | Engine | Partitioning | Sort key | Written by | Read by | +| ---------------------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | +| `prices.assets` | `ReplacingMergeTree(updated_at)` | none | `(asset_code, issuer_address, contract_address)` | Asset Discovery Lambda; Prices Ledger Processor (inline) | All asset/price endpoints | +| `prices.price_ohlcv_1m` | `ReplacingMergeTree(version)` | `toYYYYMM(timestamp)` | `(asset_id, quote_asset_id, source, timestamp)` | Prices Ledger Processor; backfill streams (sdex-cloud-push, soroban-amm completion push); Cleanup Worker (DROP PARTITION) | `GET /ohlcv` (1m timeframe), Current Price Updater, MV chain feeding rolled granularities | +| `prices.price_ohlcv_15m` / `_1h` / `_4h` / `_1d` / `_1w` / `_1M` | `ReplacingMergeTree(version)` | `toYYYYMM(timestamp)` | `(asset_id, quote_asset_id, source, timestamp)` | MV chain on `_1m`; backfill streams (for pre-rolled ranges) | `GET /ohlcv` (rolled granularities) | +| `prices.current_prices` | `ReplacingMergeTree(updated_at)` | none | `(asset_id)` | Current Price Updater Lambda | `GET /assets`, `GET /price`, `POST /prices/batch` | +| `prices.oracle_prices` | `ReplacingMergeTree` | `toYYYYMM(timestamp)` | `(asset_id, oracle_name, timestamp)` | Oracle Fetcher Lambda; Cleanup Worker (DROP PARTITION) | `GET /oracles/{asset}` | +| `prices.backfill_progress` | `ReplacingMergeTree(updated_at)` | none | `(task_name)` | Backfill cloud-push step — one row per stream | `GET /backfill/status` | +| `prices.price_usd_series` / `_1h`, `usd_reference` / `_1h`, `identity_by_contract` | `VIEW` (plain, derived) | none (read-through) | n/a (defined over `price_ohlcv_1d` / `_1h`) | n/a — derived at read time from `close_usd` / `close` on the OHLCV tables (task 0061) | BE historical USD close series (BE task 0199); `price_usd_at` endpoint (task 0040) | --- diff --git a/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-128k-parse-and-sizing-test.md b/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-128k-parse-and-sizing-test.md new file mode 100644 index 0000000..cdb86b3 --- /dev/null +++ b/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-128k-parse-and-sizing-test.md @@ -0,0 +1,208 @@ +--- +id: "G-128k-parse-and-sizing-test" +title: "G: 128k-ledger local parse-correctness + ClickHouse sizing test (SDEX + AMM + oracle)" +type: G +task: "0063" +status: mature +spawned_from: ["G-64k-sizing-remeasure"] +spawns: [] +tags: [clickhouse, sizing, measurement, parsing, sdex, amm, soroban, oracle, hetzner, local-only] +links: + - "../../../../docs/database-schema/database-schema-overview.md" + - "../../../archive/0060_FEATURE_prices-clickhouse-crate-combined-backfill-sizing/notes/G-measurement-results.md" + - "../../../../lore/2-adrs/0007_live-data-sink-on-shared-hetzner-clickhouse.md" +history: + - date: 2026-06-19 + status: mature + who: claude + note: > + Ran a fresh 128,000-ledger backfill (62848000-62975999, locally cached + partitions, --keep-partitions, effectively offline) through the real + prices-clickhouse production schema to verify SDEX + Soroban-AMM + oracle + parse correctness and measure ground-truth per-table sizing. Fourth + ground-truth data point alongside 0060's 10k+100k and 0063's prior 64k. + Key findings: SDEX fully working (16.5M trades); AMM path works but + historical coverage is in-window-pools-only (864 Aquarius ticks, 0 + Phoenix/Soroswap); full-schema 4.13 KiB/ledger at this high-activity range + (~22 GiB/yr). LOCAL ONLY — nothing pushed/deployed. +--- + +# 128k-ledger local parse + ClickHouse sizing test (SDEX + AMM + oracle) + +> **Status:** LOCAL ONLY — not pushed, not deployed to Hetzner. Production-ready +> schema applied to a local ClickHouse mirror for parse-correctness verification +> and ground-truth sizing. + +- **Date:** 2026-06-19 +- **Branch:** `feat/0063_provision-prices-db-on-hetzner-ch-self-served` +- **ClickHouse:** `clickhouse/clickhouse-server:25.6` (local docker, `prices` db) +- **Tool:** `target/release/sdex-backfill` (unified single-pass SDEX + Soroban-AMM + oracle extractor) +- **Source data:** already-downloaded `.xdr.zst` ledgers in `.temp/sdex-backfill/` + (preserved via `--keep-partitions`) +- **Raw artifacts:** `.temp/0063-test/` (`run.log`, `time.log`, `start.txt`, `end.txt`, `SUMMARY.md`) + +## 1. Scope of the run + +| Item | Value | +|------|-------| +| Ledger range | **62,848,000 – 62,975,999** (contiguous) | +| Ledgers indexed | **128,000** (2 partitions: FC4103FF + FC4009FF) | +| Mainnet time covered | 2026-06-02 12:46 UTC → 2026-06-11 03:36 UTC (**≈ 8.62 days**) | +| Network | Effectively offline — FC4103FF locally complete (64,000 files); FC4009FF needed a 32-file top-up `aws s3 sync` (read-only public archive). `time -v`: 25.4 GiB filesystem **input** vs 17 MiB output → disk-read-bound, not download-bound. | + +Range chosen deliberately: largest contiguous block of already-downloaded ledgers +**and** overlapping the only window where prior runs (task 0060) saw live AMM +activity — so it exercises both the SDEX and AMM paths. + +## 2. Parse correctness — does extraction work? ✅ (with one AMM caveat) + +Single parse pass produced, per ledger: + +| Stream | Trade events | 1m candles (deduped, `FINAL`) | +|--------|-------------:|------------------------------:| +| **SDEX** (classic ClaimAtom) | **16,542,876** | 5,146,672 | +| **AMM — Aquarius** (Soroban) | **864** | 616 | +| **AMM — Phoenix** (Soroban) | **0** | 0 | +| **AMM — Soroswap** (Soroban) | **0** | 0 | +| **Oracle — RedStone** | 15,527 rows | 1 asset | +| **Oracle — Reflector** | 7,377 rows | 3 assets | + +- **SDEX: fully working.** 16.5M trades aggregated into clean per-minute OHLCV + across **13,979** base assets / **3,728** quote assets; sampled candles have + valid `open/high/low/close/volume`. 61,428 raw claims correctly filtered as + "zero amount" (the only WARN in the log). +- **AMM: path works end-to-end, coverage window-limited.** The 864 Aquarius ticks + prove the full Soroban chain: `LedgerCloseMeta` → event extraction → + `dispatch()` → `aquarius-extractor` → `price_ohlcv_1m (source='aquarius')`. + Phoenix & Soroswap = **0 — not a parser bug.** The backfill builds its AMM pool + registry from **factory events seen inside the indexed window** (`new_pair` / + `add_pool` / `create`). Phoenix/Soroswap pools were created before ledger + 62,848,000, so they are never registered and their swaps are silently skipped. + Aquarius appears only because some Aquarius pools were `add_pool`-created + in-window. (Documented limitation, task 0060.) +- **Oracle: working** — both RedStone and Reflector captured. + +> **Dedup near-perfect:** raw `price_ohlcv_1m` = 5,147,375 → after `OPTIMIZE … +> FINAL` = 5,147,288 (only 87 duplicate versions collapsed). + +### AMM follow-up (to get full AMM coverage) +Historical Phoenix/Soroswap (and complete Aquarius) need the pool registry +**seeded ahead of the window** — a from-genesis factory-event replay, or seeding +from BE's `soroban_events`. Until then, historical backfill AMM = "pools born +in-window only". Live/tip ingestion is unaffected (it sees factory events as they +happen). + +## 3. Timing / performance + +Measured with `/usr/bin/time -v` plus the tool's own `elapsed` counter. + +| Metric | Value | +|--------|-------| +| Wall clock (backfill `elapsed`) | **2,397 s** (39 min 57 s) | +| **Per ledger** | **18.73 ms/ledger** | +| **Throughput** | **53.4 ledgers/s** ≈ **6,900 trade events/s** | +| User CPU | 1,773.8 s | +| Sys CPU | 22.1 s | +| CPU utilisation | 74% (largely single-core; parse is serial) | +| Peak RSS | **98,344 KiB ≈ 96 MiB** | +| FS read | ≈ 25.4 GiB | +| FS write | ≈ 17 MiB (CH writes go over HTTP, not counted here) | + +Single-core + disk-read bound — not memory or network bound. + +## 4. Database size (production-ready schema) + +Schema applied exactly as production: `init.sql` (tables) + `views.sql` +(read-surface views) + `seed.sql` (progress bootstrap). Coarse granularities +populated with `preroll.sql`, then `OPTIMIZE … FINAL` on every table. + +> Production's live tip uses the **refreshable** MV chain in `rollups.sql`, +> intentionally **not** applied here: a refreshable MV *replaces* its target with +> only a `now() − 2h` window, which would wipe historical backfilled partitions. +> `preroll.sql` produces identical coarse-table contents for historical data. + +### Per-table (active parts, post-`OPTIMIZE FINAL`) + +| Table | Rows | Compressed | Uncompressed | Ratio | +|-------|-----:|-----------:|-------------:|------:| +| `price_ohlcv_1m` | 5,147,288 | **274.91 MiB** | 672.53 MiB | 2.45× | +| `price_ohlcv_15m` | 2,054,096 | 120.21 MiB | 268.38 MiB | 2.23× | +| `price_ohlcv_1h` | 1,128,247 | 68.84 MiB | 147.41 MiB | 2.14× | +| `price_ohlcv_4h` | 530,625 | 33.40 MiB | 69.33 MiB | 2.08× | +| `price_ohlcv_1d` | 173,835 | 11.35 MiB | 22.71 MiB | 2.00× | +| `price_ohlcv_1w` | 55,354 | 3.75 MiB | 7.23 MiB | 1.93× | +| `price_ohlcv_1M` | 33,870 | 2.15 MiB | 4.43 MiB | 2.06× | +| `assets` | 14,339 | 1.36 MiB | 1.94 MiB | — | +| `backfill_sdex_ledgers` | 128,000 | 502.38 KiB | 500.00 KiB | — | +| `oracle_prices` | 22,904 | 409.51 KiB | 6.19 MiB | 15.5× | +| `backfill_progress` | 2 | 350 B | 125 B | — | +| **TOTAL (active)** | **9,288,560** | **516.87 MiB** | **1.17 GiB** | 2.32× | + +- On-disk `store/` dir (`du`): **1.7 GiB** — higher than 516.87 MiB because it + still holds inactive pre-merge parts pending cleanup + system tables. The + **authoritative compressed footprint is 516.87 MiB** (active parts). + +### Per-ledger sizing (this range) + +| Metric | Value | +|--------|------:| +| Full schema (all tables) | **4.13 KiB/ledger** | +| `price_ohlcv_1m` only | 2.20 KiB/ledger | + +## 5. Production (Hetzner) projection + +Mainnet pace from the window: 128,000 / 8.62 days ≈ **14,850 ledgers/day ≈ 5.42M +ledgers/year**. + +| Projection | Full schema | `_1m` only | +|------------|------------:|-----------:| +| Per day | ≈ 60 MiB | ≈ 32 MiB | +| Per year | **≈ 21.9 GiB** | ≈ 11.6 GiB | + +### ⚠️ Activity-dependence caveat +This is a **trade-dense** range (~129 trades/ledger, ~40 candles/ledger). The +prior 64k measure (range 62,016,000, [[G-64k-sizing-remeasure]]) was only **1.87 +KiB/ledger** full-schema — ~2.2× lower density. **Sizing scales with market +activity, not ledger count.** Treat annual sizing as a **range, ~10–22 GiB/yr**, +and size Hetzner storage toward the high end with headroom. Either way it is +comfortably small for ClickHouse. + +## 6. Isolation / hygiene checks + +- All writes landed in `prices`; `default` database has **no tables** (no leakage). +- `backfill_sdex_ledgers` holds exactly 128,000 contiguous sequences + (62,848,000–62,975,999) — complete, no gaps. +- Downloaded ledger partitions **preserved** (`--keep-partitions`). +- Nothing pushed to git or to Hetzner. Artifacts in `.temp/0063-test/`. + +## 7. Verdict + +| Area | Result | +|------|--------| +| SDEX extraction | ✅ Correct, high-volume, valid OHLCV | +| AMM extraction (mechanism) | ✅ Works end-to-end (Aquarius proven) | +| AMM historical coverage | ⚠️ In-window pools only — needs a seeded factory registry for full Phoenix/Soroswap history | +| Oracle extraction | ✅ RedStone + Reflector captured | +| Production schema fidelity | ✅ `init.sql`+`views.sql`+`preroll.sql`, `OPTIMIZE FINAL` | +| Sizing ground truth | ✅ 516.87 MiB / 128k = 4.13 KiB/ledger (this range) | +| Parse performance | ✅ 18.7 ms/ledger, 96 MiB RSS, single-core bound | + +**Schema and parse pipeline are production-ready.** The one substantive gap is +**historical AMM pool discovery** (Phoenix/Soroswap) — a known, separately +tracked limitation, not a defect in this run. + +## Appendix — reproduce + +```bash +docker exec -i stellar-prices-api-clickhouse-1 clickhouse-client --query "DROP DATABASE IF EXISTS prices" +for f in init views seed; do + docker exec -i stellar-prices-api-clickhouse-1 clickhouse-client --multiquery \ + < packages/prices-clickhouse/schema/$f.sql +done +CLICKHOUSE_URL=http://localhost:8123 /usr/bin/time -v -o .temp/0063-test/time.log \ + ./target/release/sdex-backfill --start 62848000 --end 62975999 \ + --temp-dir .temp/sdex-backfill --keep-partitions > .temp/0063-test/run.log 2>&1 +docker exec -i stellar-prices-api-clickhouse-1 clickhouse-client --multiquery \ + < packages/prices-clickhouse/schema/preroll.sql +# OPTIMIZE FINAL every prices.* MergeTree, then run schema/measure.sql +``` diff --git a/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-64k-sizing-remeasure.md b/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-64k-sizing-remeasure.md new file mode 100644 index 0000000..288e0cf --- /dev/null +++ b/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-64k-sizing-remeasure.md @@ -0,0 +1,154 @@ +--- +id: "G-64k-sizing-remeasure" +title: "G: prices.* footprint — fresh 64k-ledger ground-truth re-measure (capacity check for provisioning)" +type: G +task: "0063" +status: mature +spawned_from: ["G-provisioning-plan"] +spawns: ["G-128k-parse-and-sizing-test"] +tags: [clickhouse, sizing, capacity, cost, hetzner, measurement, shared-vs-sidecar] +links: + - "../../../../docs/database-schema/database-schema-overview.md" + - "../../../archive/0060_FEATURE_prices-clickhouse-crate-combined-backfill-sizing/notes/G-measurement-results.md" + - "../../../archive/0046_RESEARCH_empirical-prices-ch-storage-estimate-from-10k-ledgers/notes/G-empirical-storage-estimate.md" + - "../../../../lore/2-adrs/0007_live-data-sink-on-shared-hetzner-clickhouse.md" +history: + - date: 2026-06-19 + status: mature + who: claude + note: > + Ran a fresh 64,000-ledger backfill (62016000-62079999, cached partition, + no download) through the real prices-clickhouse pipeline and measured + system.parts. Third ground-truth data point alongside task 0060's 10k + + 100k runs. Confirms the ~KB/ledger reality (NOT the 74 B/ledger 0046 + estimate) and refreshes the shared-vs-sidecar cost comparison. +--- + +# G: prices.* footprint — fresh 64k-ledger ground-truth re-measure + +## 0. Why this note exists + +A cost/architecture question (shared `prices` DB in BE's CH vs a dedicated +prices CH container) was answered first from the **task-0046 estimate +(~74 B/ledger, ~0.45 GB/yr)**. That estimate is a per-event projection and is +**wrong by ~25-50×** — task 0060 already measured ~3.6 KB/ledger but used +windows 62966000+ / 62882700+. This note adds an **independent 64k window** +(62016000-62079999) measured end-to-end, so the provisioning decision rests on +three real data points, not the superseded estimate. + +**Fully local** (the standing prepare-only / local-only constraint): docker +ClickHouse 25.6 on `localhost:8123`; the partition was already on disk +(`.temp/sdex-backfill/FC4DB5FF--62016000-62079999`, 64,000 files) so **no S3 +fetch**. No prod infra touched. + +## 1. Run parameters + +| Aspect | Value | +|---|---| +| Window | ledgers `62016000`-`62079999` (64,000; ~3.7 days mainnet) | +| Pipeline | clean `prices.*` schema → `sdex-backfill` → `preroll.sql` (coarse rollups from `_1m FINAL`) → `OPTIMIZE … FINAL` → `measure.sql` | +| Backfill wall-clock | **1,126 s (~18.8 min)**, cached partition (~17.6 ms/ledger — matches 0060's ~17 ms/ledger index rate) | +| SDEX trade ticks | 3,361,790 (~52.5/ledger) | +| AMM trade ticks | 13 (in-window-registry limitation, same as 0060) | +| Oracle rows | 8,161 | +| Distinct assets | 8,983 | +| `close_usd` enrichment | **not run** (same as 0060) — column present but unpopulated, compresses to ~0; see caveat §5 | + +## 2. Measured footprint (compressed on disk, after OPTIMIZE FINAL) + +| Table | Rows | Disk | B/ledger | +|---|---:|---:|---:| +| price_ohlcv_1m | 1,385,272 | 53.16 MiB | 871 | +| price_ohlcv_15m | 499,257 | 25.61 MiB | 419 | +| price_ohlcv_1h | 286,021 | 16.65 MiB | 273 | +| price_ohlcv_4h | 158,635 | 9.88 MiB | 162 | +| price_ohlcv_1d | 70,615 | 4.62 MiB | 76 | +| price_ohlcv_1w | 23,758 | 1.54 MiB | 25 | +| price_ohlcv_1M | 23,758 | 1.54 MiB | 25 | +| assets | 8,983 | 875.55 KiB | 14 | +| backfill_sdex_ledgers | 64,000 | 251.19 KiB | 4 | +| oracle_prices | 7,971 | 132.07 KiB | 2 | +| backfill_progress | 2 | 350 B | — | +| **TOTAL** | **2,528,272** | **114.23 MiB** | **≈1,872** | + +## 3. Three-window comparison — per-ledger cost is activity-driven + +| Sample | Window | SDEX ticks/ledger | Assets | `_1m` candles/ledger | **B/ledger** | +|---|---|---:|---:|---:|---:| +| 0060 calib (10k) | 62966000+ | 122 | 4,343 | 31.7 | 3,597 | +| 0060 full (100k) | 62882700+ | 116 | 12,770 | 35.9 | 3,677 | +| **This run (64k)** | 62016000+ | 53 | 8,983 | 21.6 | **1,872** | + +The driver is **trading-pair diversity + trade density**, not ledger count. +This window is an earlier, ~half-as-active period, hence ~1,872 vs ~3,677. Real +per-ledger cost is **window/time-dependent, ~1.9-3.7 KB/ledger** — a ~2× spread. +All three are **25-50× the 0046 ~74 B/ledger estimate**, which is now +superseded for sizing/cost purposes. + +## 4. Corrected annual projection (per env) + +| Basis | Year 1 | Notes | +|---|---:|---| +| Naive per-ledger (this 64k, low-activity) | ~11.8 GB | all grains forever | +| Naive per-ledger (0060, higher activity) | ~23 GB | upper end | +| **0060 per-bucket refined** (rollups amortize) | **~5-6 GB** | realistic, higher activity | +| Scaled to this window's activity | **~3-4 GB** | realistic, this window | +| With `_1h`/`_4h` retention cap @ 1yr | **~9 GB @ 10yr** (vs ~43 GB unbounded) | strongest size lever (0060) | + +Realistic Year-1 ≈ **3.5-6 GB/env** — an order of magnitude above 0046's +0.45 GB/yr, still trivial for a 1 TB Hetzner box. Levers (from 0060): +**retention-cap `_1h`/`_4h`** bounds growth; **top-500 asset filter** keeps +93% of trades and ~halves rollup growth. + +## 5. Caveats + +- **`close_usd` / USD-series unpopulated.** The BE-consumed historical USD + prices ride as the `close_usd` column on these candle rows (task 0061), not a + new row class. Enrichment wasn't run, so that column is ~0 and compresses to + near-nothing here — when populated it adds only a few % (one Decimal/row across + grains). Footprint is dominated by candle/pair diversity, not the USD layer. + A follow-up enrich-then-measure would quantify the exact `close_usd` delta. +- **AMM candles ≈ 0** (13 ticks) — Phoenix/Soroswap/Aquarius pools created + before the window are unresolved without a historical factory-replay registry + seed. Real production with full AMM coverage is **higher** than measured here. +- **REDSTONE** stored as raw payload (price decode deferred). + +## 6. Cost comparison — shared `prices` DB vs dedicated prices container + +Box = BE's AX52 (€69/mo). Prices' *incremental* disk cost on an +already-sized box is ~$0; the figures are the fair **goodwill pro-rata +cost-share** (shared) vs **effective resource cost** (separate container). Disk +bytes are identical both ways; the delta is dedicated RAM/CPU + the contract +break. Corrected for the measured ~3.5-6 GB/yr (prices ≈ 10-15% of a ~40 GB/yr +data plane), not the old ~1%. + +| Component | **Shared `prices` DB (current / ADR 0007)** | **Dedicated prices container (Alt-3)** | +|---|---|---| +| Measured storage | ~3.5-6 GB/yr → ~10-15% of data plane | same bytes | +| Storage pro-rata of box | ~10-15% × €69 ≈ **$8-11/env/mo** | ~$8-11 (same) | +| Dedicated CH RAM/CPU | $0 (uses BE headroom) | **+$8-12/env/mo** — reserves ~8-16 GB RAM (mark cache/merges/queries) + merge threads, idle or not | +| Box-tier upgrade pressure | none | possible **AX52→AX102** (~+€41/mo shared) | +| **Blended $/env/mo** | **~$8-11** | **~$16-25** | +| **× 3 envs** | **~$24-33/mo** | **~$48-75/mo** | +| Ops surface | 1 CH, 1 `users.d`, 1 backup | 2 CH (lockstep upgrades), 2 `users.d`, 2 backups | +| **BE in-cluster `price_usd_series` JOIN (0199 contract)** | ✅ works | ❌ **breaks** — needs cross-server query / HTTP sync the contract rejected | + +**Bottom line:** the corrected footprint raises the honest cost-share with BE +from ~1%/$1-2 to ~10-15%/**$8-11 per env/mo**, but does **not** change the +architecture: even ~9 GB at 10 years (with the `_1h`/`_4h` cap) is trivial for +the shared box, while a dedicated container costs ~2× more **and** breaks the +agreed in-cluster USD-views JOIN. Sidecar stays the **task-0047-RED fallback +only**, per ADR 0007. + +## 7. Reproduction + +```bash +docker compose up -d clickhouse +curl -s localhost:8123 --data-binary "DROP DATABASE IF EXISTS prices" +CLICKHOUSE_URL=http://localhost:8123 cargo run -q -p prices-clickhouse --bin prices-clickhouse-init +CLICKHOUSE_URL=http://localhost:8123 ./target/release/sdex-backfill --start 62016000 --end 62079999 +CONT=$(docker compose ps -q clickhouse) +docker exec -i "$CONT" clickhouse-client --multiquery < packages/prices-clickhouse/schema/preroll.sql +# OPTIMIZE … FINAL each prices.* table, then: +curl -s localhost:8123 --data-binary "$(cat packages/prices-clickhouse/schema/measure.sql)" +``` diff --git a/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-provisioning-plan.md b/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-provisioning-plan.md index f944bce..0fd34d3 100644 --- a/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-provisioning-plan.md +++ b/lore/1-tasks/active/0063_FEATURE_provision-prices-db-on-hetzner-ch-self-served/notes/G-provisioning-plan.md @@ -5,7 +5,7 @@ type: G task: "0063" status: mature spawned_from: ["G-be-prices-db-rbac-ask"] -spawns: [] +spawns: ["G-64k-sizing-remeasure"] related_notes: - "../../../backlog/0050_FEATURE_be-side-prep-sns-mtls-prices-db-provisioning/notes/G-be-prices-db-rbac-ask.md" links: