Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions .github/workflows/e2e-mw-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@
#
# Replaces what kind (tests/k8s/) cannot exercise: real Cilium network
# policies, real Crossplane Duckling provisioning, real cnpg-shard + external
# RDS metadata stores, and the real per-org Lakekeeper operator — the layers
# where this quarter's bugs actually lived (Cilium egress, lakekeeper
# encryption-key drift, cnpg role drift, RBAC delete gaps).
# RDS metadata stores — the layers kind cannot model.
#
# Flow:
# 1. build arm64-only worker + control-plane images, tagged pr-<N>-<sha>,
Expand All @@ -17,7 +15,7 @@
# ClusterIP service (no public DNS / NLB needed). Covers the cnpg-shard
# and external metadata backends.
# 5. always tear the namespace down (and deprovision the ci-pr ducklings so
# no S3 / cnpg role / lakekeeper CR leaks on shared infra).
# no S3 / cnpg role leaks on shared infra).
#
# SECURITY — who can run this:
# Same model as the AWS/OIDC job in ci.yml: the gate is the repo setting
Expand Down Expand Up @@ -201,8 +199,8 @@ jobs:
uses: azure/setup-kubectl@776406bce94f63e41d621b960d78ee25c8b76ede # v4.0.1
- name: Update kubeconfig
run: aws eks update-kubeconfig --name "$CLUSTER_NAME" --region us-east-1 --alias "$KUBE_CONTEXT"
# Deprovision the ci-pr ducklings (so no S3 / cnpg role+db / lakekeeper CR
# leaks on shared infra) then delete the namespace.
# Deprovision the ci-pr ducklings (so no S3 / cnpg role+db leaks on shared
# infra) then delete the namespace.
- name: Teardown
run: bash tests/e2e-mw-dev/run.sh teardown

Expand Down
15 changes: 7 additions & 8 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ exercises it (and update it if the path moved) — a refactor that quietly drops
e2e coverage is a regression in the test suite even if behavior is unchanged. Unit/package tests are necessary but not sufficient: a
change is only "done" once it is exercised against the real mw-dev cluster —
real worker pods, real Crossplane ducklings, real cnpg/RDS metadata, real
Lakekeeper, real S3/Iceberg/STS. "Solid" means a deterministic pass/fail
S3/STS. "Solid" means a deterministic pass/fail
assertion of the actual user-visible behavior (not just "it didn't error"), with
transient/cold-pool conditions handled, on both metadata backends (cnpg + ext)
where it touches metadata. A bugfix gets a regression assertion that would have
Expand All @@ -139,20 +139,19 @@ Three test lanes worth knowing about, in increasing order of blast radius:

- **Unit / package tests** (`go test ./...`): in-process, no external deps. Where most coverage lives. Includes `tests/manifests/` (static-manifest artifact asserts for `k8s/rbac.yaml` + `k8s/networkpolicy.yaml`).
- **`tests/integration/`** (`just test-integration`): spins up the standalone server binary against a real MinIO + Postgres metadata store via docker compose. Covers wire protocol, DuckLake on real S3-compatible storage, transpilation against a live server.
- **`tests/e2e-mw-dev/`** (per-PR GitHub workflow `e2e-mw-dev.yml`): the full multi-tenant activation pipeline against the **real posthog-mw-dev EKS cluster** — real Cilium, real Crossplane ducklings, real cnpg-shard + external-RDS metadata, real per-org Lakekeeper, real AWS S3/Iceberg. A shell harness (`harness.sh`) runs as an in-cluster Job per PR; `run.sh` orchestrates deploy/test/teardown/e2e-cleanup. **Replaces the retired kind suite** (`tests/k8s/`) — that suite's `k8s-integration-tests` CI job and its Go tests are gone; the supporting `k8s/` scripts/manifests + Dockerfiles are kept for now. See `tests/e2e-mw-dev/README.md`.
- **`tests/e2e-mw-dev/`** (per-PR GitHub workflow `e2e-mw-dev.yml`): the full multi-tenant activation pipeline against the **real posthog-mw-dev EKS cluster** — real Cilium, real Crossplane ducklings, real cnpg-shard + external-RDS metadata, real AWS S3. A shell harness (`harness.sh`) runs as an in-cluster Job per PR; `run.sh` orchestrates deploy/test/teardown/e2e-cleanup. **Replaces the retired kind suite** (`tests/k8s/`) — that suite's `k8s-integration-tests` CI job and its Go tests are gone; the supporting `k8s/` scripts/manifests + Dockerfiles are kept for now. See `tests/e2e-mw-dev/README.md`.

### When code changes obligate test changes

`tests/e2e-mw-dev/` is the only place we exercise the full activation pipeline (control plane → STS broker → worker pod → DuckDB → ATTACH against real cloud storage). If your change touches any of the following, treat updating the harness as part of the change, not a follow-up:

- `controlplane/shared_worker_activator.go`, `controlplane/sts_broker.go`, anything in the activation payload shape (`TenantActivationPayload`, `server.DuckLakeConfig`, `server.IcebergConfig`)
- `server/server.go::AttachDeltaCatalog`, `server.AttachIcebergCatalog`, `server.attachDuckLake*`, `server.refresh*Secret`
- `server/iceberg/` (config, dispatcher, backend implementations) — the harness provisions iceberg-enabled ducklings on both cnpg + ext backends and asserts the catalog attaches + reads/writes
- `controlplane/shared_worker_activator.go`, `controlplane/sts_broker.go`, anything in the activation payload shape (`TenantActivationPayload`, `server.DuckLakeConfig`)
- `server/server.go::AttachDeltaCatalog`, `server.attachDuckLake*`, `server.refresh*Secret`
- `controlplane/configstore/models.go` — new columns flow through the provisioning API the harness calls; exercise them via a provision body field
- `duckdbservice/activation.go`, `worker_activation.go` — worker-side activation order
- Any code path that wires AWS credentials through to DuckDB SECRETs

The contract: if the harness no longer exercises a path you changed, **update `harness.sh`**; if your change removes a path it asserts against, **delete the assertion**. The DuckLake round-trip / durability / concurrent-writers / iceberg activation checks in `harness.sh` are the load-bearing ones for catalog wiring — keep them honest.
The contract: if the harness no longer exercises a path you changed, **update `harness.sh`**; if your change removes a path it asserts against, **delete the assertion**. The DuckLake round-trip / durability / concurrent-writers checks in `harness.sh` are the load-bearing ones for catalog wiring — keep them honest.

## Dependencies

Expand Down Expand Up @@ -271,8 +270,8 @@ Invariants for anyone touching this path:
secrets — persistent ones AND non-persistent (plain/TEMPORARY `CREATE
SECRET`) ones, which pass through to the worker and would otherwise leak to
the next user. It preserves only the system-managed allowlist
(`usersecrets.IsReservedName`: `ducklake_s3`/`iceberg_sigv4`/`iceberg_oauth`
+ the `__default_*`/`duckgres_*` prefixes, which activation re-creates). It
(`usersecrets.IsReservedName`: `ducklake_s3` + the
`__default_*`/`duckgres_*` prefixes, which activation re-creates). It
MUST run before replay on every CreateSession in shared-warm mode, and a
wipe failure MUST fail the session.
- **Execute-then-persist ordering.** Persist only statements DuckDB accepted;
Expand Down
4 changes: 1 addition & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,7 @@ RUN : "${DUCKDB_EXTENSION_VERSION:?must be set}" \
| gunzip > "/build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/json.duckdb_extension" \
&& curl -fsSL "${POSTGRES_SCANNER_REPOSITORY}/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/postgres_scanner.duckdb_extension.gz" \
| gunzip > "/build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/postgres_scanner.duckdb_extension" \
&& curl -fsSL "${DUCKDB_EXTENSION_REPOSITORY}/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/iceberg.duckdb_extension.gz" \
| gunzip > "/build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/iceberg.duckdb_extension" \
&& for f in httpfs ducklake json postgres_scanner iceberg; do \
&& for f in httpfs ducklake json postgres_scanner; do \
[ -s "/build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/$f.duckdb_extension" ] \
|| { echo "ERROR: $f.duckdb_extension is empty after fetch" >&2; exit 1; }; \
done
Expand Down
4 changes: 1 addition & 3 deletions Dockerfile.worker
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,7 @@ RUN : "${DUCKDB_EXTENSION_VERSION:?must be set}" \
| gunzip > "/build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/json.duckdb_extension" \
&& curl -fsSL "${POSTGRES_SCANNER_REPOSITORY}/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/postgres_scanner.duckdb_extension.gz" \
| gunzip > "/build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/postgres_scanner.duckdb_extension" \
&& curl -fsSL "${DUCKDB_EXTENSION_REPOSITORY}/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/iceberg.duckdb_extension.gz" \
| gunzip > "/build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/iceberg.duckdb_extension" \
&& for f in httpfs ducklake json postgres_scanner iceberg; do \
&& for f in httpfs ducklake json postgres_scanner; do \
[ -s "/build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/$f.duckdb_extension" ] \
|| { echo "ERROR: $f.duckdb_extension is empty after fetch" >&2; exit 1; }; \
done
Expand Down
11 changes: 5 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,6 @@ LIMIT 20;
- [Performance Harness](docs/perf-harness-runbook.md): Local smoke and nightly operations for performance testing.
- [Control Plane Rollout](docs/runbooks/control-plane-rollout.md): Zero-downtime deployment process for the control plane itself.
- [Managed Warehouse Deprovision](docs/runbooks/managed-warehouse-deprovision.md): Destructive teardown process for managed warehouse infrastructure and org cleanup.
- [Lakekeeper Iceberg Catalog](docs/runbooks/lakekeeper-iceberg-catalog.md): Per-org Lakekeeper Iceberg REST catalog backend — architecture, the no-vending credential model, activation, and the `ACCESS_DELEGATION_MODE 'none'` gotcha.

## Quick Start

Expand Down Expand Up @@ -324,12 +323,12 @@ teardown), so you can build a provisioning funnel and alert on failures.

| Event | Fires when | Properties |
| --- | --- | --- |
| `warehouse_provision_begin` | Provisioning accepted by the admin API (warehouse not usable yet) | `database_name`, `metadata_store`, `ducklake_enabled`, `iceberg_enabled` |
| `warehouse_provision_success` | Warehouse reaches Ready and is usable (provisioner controller) | `metadata_store`, `ducklake_enabled`, `iceberg_enabled` |
| `warehouse_provision_failed` | Warehouse reaches Failed (provisioner controller) | `metadata_store`, `ducklake_enabled`, `iceberg_enabled`, `reason` (`provisioning_timeout`/`crossplane_sync_failure`) |
| `warehouse_provision_begin` | Provisioning accepted by the admin API (warehouse not usable yet) | `database_name`, `metadata_store`, `ducklake_enabled` |
| `warehouse_provision_success` | Warehouse reaches Ready and is usable (provisioner controller) | `metadata_store`, `ducklake_enabled` |
| `warehouse_provision_failed` | Warehouse reaches Failed (provisioner controller) | `metadata_store`, `ducklake_enabled`, `reason` (`provisioning_timeout`/`crossplane_sync_failure`) |
| `warehouse_deprovision_begin` | Deprovisioning accepted by the admin API (teardown not finished yet) | — |
| `warehouse_deprovision_success` | All underlying resources deleted (provisioner controller) | — |
| `warehouse_deprovision_failed` | A teardown attempt failed (provisioner controller) | `reason` (`duckling_delete_failed`/`lakekeeper_teardown_failed`) |
| `warehouse_deprovision_failed` | A teardown attempt failed (provisioner controller) | `reason` (`duckling_delete_failed`) |
| `warehouse_password_reset` | An org's root password is reset (admin API) | `username` |
| `query_initiated` | A client query begins execution | `user`, `trace_id` |
| `query_failed` | A query errors | `user`, `trace_id`, `error_code` (SQLSTATE), `error_category` (`user`/`system`/`conflict`/`metadata_connection_lost`) |
Expand Down Expand Up @@ -775,7 +774,7 @@ Managed-warehouse contract notes:

- At most one managed-warehouse row exists per team. The row may be absent before first provisioning or after cleanup, but there is never more than one active warehouse contract for a team.
- The admin API exposes that contract at `GET /api/v1/teams/:name/warehouse` and `PUT /api/v1/teams/:name/warehouse`. Team list/get responses also include a nested `warehouse` object when present.
- User rows support an optional `default_catalog` field on `POST /api/v1/users` and `PUT /api/v1/orgs/:id/users/:username`. The default is empty, which preserves the standard DuckLake-first session behavior. Set `default_catalog` to `iceberg` for users whose sessions should resolve schema-qualified names and compatibility metadata through the Iceberg catalog by default; a client-supplied startup `search_path` still takes precedence.
- User rows support an optional `default_catalog` field on `POST /api/v1/users` and `PUT /api/v1/orgs/:id/users/:username`. The default is empty, which preserves the standard DuckLake session behavior. `ducklake` is the only non-empty value accepted.
- The typed sections are `warehouse_database`, `metadata_store`, `s3`, `worker_identity`, and structured secret refs for `warehouse_database_credentials`, `metadata_store_credentials`, `s3_credentials`, and `runtime_config`. In shared worker mode, every non-empty secret ref must store an explicit `namespace`, and it must match `worker_identity.namespace`.
- Secret references only are stored in the config store. Secret material remains outside the database.
- The provisioning fields are stored directly on the warehouse row as overall `state` / `status_message`, per-resource `*_state` / `*_status_message`, plus `ready_at` and `failed_at`.
Expand Down
25 changes: 0 additions & 25 deletions cmd/duckgres-worker/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ import (
"github.com/posthog/duckgres/duckdbservice"
"github.com/posthog/duckgres/internal/cliboot"
"github.com/posthog/duckgres/server"
"github.com/posthog/duckgres/server/lakekeeperbroker"
)

// Each duckgres binary owns its own package-main version/commit/date because
Expand Down Expand Up @@ -79,9 +78,6 @@ func main() {
duckLakeDeltaCatalogEnabled := flag.Bool("ducklake-delta-catalog-enabled", true, "Attach a Delta Lake catalog during DuckLake worker boot (default true; use --ducklake-delta-catalog-enabled=false to disable; env: DUCKGRES_DUCKLAKE_DELTA_CATALOG_ENABLED)")
duckLakeDeltaCatalogPath := flag.String("ducklake-delta-catalog-path", "", "Delta Lake catalog/table path to attach (env: DUCKGRES_DUCKLAKE_DELTA_CATALOG_PATH)")
duckLakeDefaultSpecVersion := flag.String("ducklake-default-spec-version", "", "Default DuckLake spec version for migration checks (env: DUCKGRES_DUCKLAKE_DEFAULT_SPEC_VERSION)")
icebergEnabled := flag.Bool("iceberg-enabled", false, "Attach a per-tenant Iceberg catalog (Lakekeeper REST) at session init (env: DUCKGRES_ICEBERG_ENABLED)")
icebergRegion := flag.String("iceberg-region", "", "AWS region for S3 object access by Iceberg (env: DUCKGRES_ICEBERG_REGION)")
icebergNamespace := flag.String("iceberg-namespace", "", "Default Iceberg namespace (env: DUCKGRES_ICEBERG_NAMESPACE)")

// Query log
queryLog := flag.Bool("query-log", true, "Enable/disable DuckLake query log (use --query-log=false to disable; env: DUCKGRES_QUERY_LOG_ENABLED)")
Expand Down Expand Up @@ -206,9 +202,6 @@ func main() {
DuckLakeDeltaCatalogEnabled: *duckLakeDeltaCatalogEnabled,
DuckLakeDeltaCatalogPath: *duckLakeDeltaCatalogPath,
DuckLakeDefaultSpecVersion: *duckLakeDefaultSpecVersion,
IcebergEnabled: *icebergEnabled,
IcebergRegion: *icebergRegion,
IcebergNamespace: *icebergNamespace,
QueryLog: *queryLog,
}, os.Getenv, func(msg string) {
slog.Warn(msg)
Expand Down Expand Up @@ -248,23 +241,6 @@ func main() {
os.Exit(1)
}

// Lakekeeper OIDC SA-token broker. Started when DUCKGRES_LAKEKEEPER_TOKEN_PATH
// is set (typically by the duckling pod spec, mounting a projected SA
// token volume at /var/run/secrets/lakekeeper/token). DuckDB's iceberg
// extension POSTs to OAUTH2_SERVER_URI=http://127.0.0.1:9876/token to
// fetch a bearer; the broker reads the projected file and hands it back.
// When the env var is unset, no broker is started — preserves the
// allowall + NetworkPolicy posture for clusters that haven't enabled
// OIDC on Lakekeeper yet.
if tokenPath := configloader.Env("DUCKGRES_LAKEKEEPER_TOKEN_PATH", ""); tokenPath != "" {
broker := lakekeeperbroker.New(tokenPath)
if err := broker.ListenAndServe(lakekeeperbroker.DefaultAddr); err != nil {
fmt.Fprintf(os.Stderr, "Failed to start lakekeeper broker on %s: %s\n", lakekeeperbroker.DefaultAddr, err)
os.Exit(1)
}
slog.Info("Lakekeeper OIDC broker started.", "addr", lakekeeperbroker.DefaultAddr, "token_path", tokenPath)
}

// No initMetrics() here. In control-plane mode all worker pods would
// fight over :9090; the control plane process owns the metrics
// endpoint. The duckdb-service exposes per-session metrics via its own
Expand All @@ -277,4 +253,3 @@ func main() {
MaxSessions: maxSessions,
})
}

10 changes: 0 additions & 10 deletions configloader/file_config.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ type FileConfig struct {
RateLimit RateLimitFileConfig `yaml:"rate_limit"`
Extensions []string `yaml:"extensions"`
DuckLake DuckLakeFileConfig `yaml:"ducklake"`
Iceberg IcebergFileConfig `yaml:"iceberg"`
FilePersistence bool `yaml:"file_persistence"`
ProcessIsolation bool `yaml:"process_isolation"`
IdleTimeout string `yaml:"idle_timeout"`
Expand Down Expand Up @@ -106,15 +105,6 @@ type RateLimitFileConfig struct {
MaxConnections int `yaml:"max_connections"`
}

// IcebergFileConfig is the YAML shape for opting a single-tenant duckgres
// instance into Iceberg catalog attachment. In multi-tenant mode the
// equivalent values come from the per-warehouse configstore row, not YAML.
type IcebergFileConfig struct {
Enabled *bool `yaml:"enabled"`
Region string `yaml:"region"`
Namespace string `yaml:"namespace"`
}

type DuckLakeFileConfig struct {
MetadataStore string `yaml:"metadata_store"`
ObjectStore string `yaml:"object_store"`
Expand Down
6 changes: 0 additions & 6 deletions configresolve/cliflags.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,6 @@ func RegisterCLIInputsFlags(fs *flag.FlagSet) func() CLIInputs {
duckLakeDeltaCatalogEnabled := fs.Bool("ducklake-delta-catalog-enabled", true, "Attach a Delta Lake catalog during DuckLake worker boot (default true; use --ducklake-delta-catalog-enabled=false to disable; env: DUCKGRES_DUCKLAKE_DELTA_CATALOG_ENABLED)")
duckLakeDeltaCatalogPath := fs.String("ducklake-delta-catalog-path", "", "Delta Lake catalog/table path to attach, defaults to sibling delta/ prefix at DuckLake object-store root (env: DUCKGRES_DUCKLAKE_DELTA_CATALOG_PATH)")
duckLakeDefaultSpecVersion := fs.String("ducklake-default-spec-version", "", "Default DuckLake spec version for migration checks (env: DUCKGRES_DUCKLAKE_DEFAULT_SPEC_VERSION)")
icebergEnabled := fs.Bool("iceberg-enabled", false, "Attach a per-tenant Iceberg catalog (Lakekeeper REST) at session init (env: DUCKGRES_ICEBERG_ENABLED)")
icebergRegion := fs.String("iceberg-region", "", "AWS region for S3 object access by Iceberg (default: us-east-1) (env: DUCKGRES_ICEBERG_REGION)")
icebergNamespace := fs.String("iceberg-namespace", "", "Default Iceberg namespace (informational; default: main) (env: DUCKGRES_ICEBERG_NAMESPACE)")
processMinWorkers := fs.Int("process-min-workers", 0, "Pre-warm worker count at startup for process workers (control-plane mode) (env: DUCKGRES_PROCESS_MIN_WORKERS)")
processMaxWorkers := fs.Int("process-max-workers", 0, "Max process workers, 0=auto-derived (control-plane mode) (env: DUCKGRES_PROCESS_MAX_WORKERS)")
processRetireOnSessionEnd := fs.Bool("process-retire-on-session-end", false, "Retire a process worker immediately after its last session ends instead of keeping it warm for reuse (control-plane mode) (env: DUCKGRES_PROCESS_RETIRE_ON_SESSION_END)")
Expand Down Expand Up @@ -103,9 +100,6 @@ func RegisterCLIInputsFlags(fs *flag.FlagSet) func() CLIInputs {
cli.DuckLakeDeltaCatalogEnabled = *duckLakeDeltaCatalogEnabled
cli.DuckLakeDeltaCatalogPath = *duckLakeDeltaCatalogPath
cli.DuckLakeDefaultSpecVersion = *duckLakeDefaultSpecVersion
cli.IcebergEnabled = *icebergEnabled
cli.IcebergRegion = *icebergRegion
cli.IcebergNamespace = *icebergNamespace
cli.ProcessMinWorkers = *processMinWorkers
cli.ProcessMaxWorkers = *processMaxWorkers
cli.ProcessRetireOnSessionEnd = *processRetireOnSessionEnd
Expand Down
Loading
Loading