Skip to content

feat(data-warehouse): resolve duckgres data-import schema from team table_suffix#65323

Merged
EDsCODE merged 3 commits into
masterfrom
eric/duckgres-data-imports-suffix
Jun 23, 2026
Merged

feat(data-warehouse): resolve duckgres data-import schema from team table_suffix#65323
EDsCODE merged 3 commits into
masterfrom
eric/duckgres-data-imports-suffix

Conversation

@EDsCODE

@EDsCODE EDsCODE commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

A DuckgresServer is org-scoped and hosts many teams, so the v3 data-import sink writes each team into its own schema — historically posthog_data_imports_team_{team_id}. This threads the team's DuckLakeBackfill.table_suffix (merged #64950 — the same field that names a team's events/persons duckling tables) through the schema name, so one user-chosen identifier governs all of a team's warehouse tables instead of an opaque team id.

suffix unset → posthog_data_imports_team_<team_id>   (today's behavior, unchanged)
suffix set   → posthog_data_imports_<suffix>

Changes

  • New shared resolver posthog.ducklake.common.duckgres_data_imports_schema(team_id) + validate_duckgres_identifier (fail-closed alnum/underscore, mirrors the duckling DAG's _validate_identifier so the suffix is validated identically wherever it hits DuckDB DDL).
  • Wired into the v3 sink processor (_duckgres_schema_name) and the legacy DuckLakeCopyDataImportsWorkflow, so the historical-copy and live-sink paths never diverge on schema for the same team. (The copy workflow already stands down for sink-enabled teams via the flag gate; this keeps non-sink suffix teams consistent too.)

Backward-compatible no-op while all suffixes are NULL — existing teams are unaffected until a suffix is explicitly set.

Follow-ups / call-outs (not in this PR)

  • Suffix change ⇒ re-prime. team_id was immutable; a suffix is not. Changing it moves the schema and orphans the old one. The clean fix is to record the primed schema on DuckgresSinkSchemaState and transition to NEEDS_RESYNC on drift — that depends on the backfill PR (feat(data-warehouse): duckgres sink backfill primer #63144), so it's a tracked follow-up. Until then a change would silently start a new schema.
  • Per-org suffix uniqueness is an assumed invariant (two teams in an org picking the same suffix collide — same as events_<suffix> would). The provision/enable flow (feat(data-warehouse): provision/enable per-team warehouse backfill with named tables #64939) must enforce it.
  • Audit before enabling: any sink-enabled team that already has a non-NULL table_suffix will move schemas on deploy and needs a one-time re-prime — not a silent switch.

Tests

Resolver fallback (no row / NULL / empty) → team-id schema; set → suffixed; unsafe suffix → ValueError; validator accept/reject cases. Processor unit tests pin the resolver via an autouse stub.

🤖 Generated with Claude Code

…able_suffix

A DuckgresServer is org-scoped and hosts many teams, so the v3 sink writes
each team into its own schema — historically posthog_data_imports_team_{id}.
This threads the team's DuckLakeBackfill.table_suffix (merged #64950, the same
field that names its events/persons duckling tables) through the schema name so
one user-chosen identifier governs all of a team's warehouse tables.

- New shared resolver posthog.ducklake.common.duckgres_data_imports_schema():
  NULL/empty suffix -> posthog_data_imports_team_{id} (today's behavior),
  set -> posthog_data_imports_{suffix}, fail-closed validated via
  validate_duckgres_identifier (mirrors the duckling DAG's _validate_identifier).
- Wired into the v3 sink processor (_duckgres_schema_name) and the legacy
  DuckLakeCopyDataImportsWorkflow, so the historical-copy and live-sink paths
  never diverge on schema for the same team.

Backward-compatible no-op while all suffixes are NULL.

Follow-up (depends on backfill #63144): a suffix CHANGE moves the schema and
orphans the old one — detect drift in the DuckgresSinkSchemaState reconciler
and trigger NEEDS_RESYNC rather than silently switching. Tracked separately.

Tests: resolver fallback/set/unsafe-suffix + validator accept/reject; processor
tests pin the resolver via an autouse stub.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Comments Outside Diff (1)

  1. posthog/temporal/ducklake/ducklake_copy_data_imports_workflow.py, line 222-234 (link)

    P2 duckgres_data_imports_schema is called inside the loop with the same inputs.team_id on every iteration, so it fires one DB query per schema_id when one query before the loop would suffice. All schemas in the batch belong to the same team.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "feat(data-warehouse): resolve duckgres d..." | Re-trigger Greptile

Comment thread posthog/ducklake/common.py
@veria-ai

veria-ai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 1 · PR risk: 0/10

@EDsCODE

EDsCODE commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Tracked follow-up for the suffix-change / re-prime hazard noted in the description: #65331 (NEEDS_RESYNC on schema drift). Gated on this PR + the backfill state machine (#63144).

…imports-suffix

# Conflicts:
#	posthog/ducklake/common.py
if not suffix:
return f"posthog_data_imports_team_{team_id}"
validate_duckgres_identifier(suffix)
return f"posthog_data_imports_{suffix}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Cross-team schema collision

DuckgresServer is org-scoped, but this resolver maps a user-controlled DuckLakeBackfill.table_suffix directly to a shared schema name without preserving team_id or checking uniqueness. A team that can set its suffix to another team’s suffix, or to team_<id> for a legacy schema, can make its import/copy jobs create or replace tables in that other team’s data-import schema, breaking tenant isolation inside the org-scoped warehouse.

Prompt To Fix With AI
Keep team isolation in duckgres data-import schemas. Either keep `team_id` in the schema name even when a suffix is set (for example `posthog_data_imports_team_<team_id>_<suffix>`), or enforce and verify uniqueness of `DuckLakeBackfill.table_suffix` within each organization/DuckgresServer before it can ever be used here, including preventing collisions with legacy `posthog_data_imports_team_<id>` names. Update all writers/readers consistently and add tests covering cross-team collision attempts.

Severity: high | Confidence: 96% | React with 👍 if useful or 👎 if not

duckgres_data_imports_schema(inputs.team_id) was called inside the per-schema
loop in prepare_data_imports_ducklake_metadata_activity, issuing one identical
DuckLakeBackfill lookup per schema_id. All schemas in a copy batch belong to the
same team, so the result is loop-invariant — hoist it above the loop into a
single query. Behavior is unchanged (same resolved schema name); only the
redundant queries are removed.

Addresses Greptile review comment on PR #65323.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EDsCODE EDsCODE merged commit 6396064 into master Jun 23, 2026
224 checks passed
@EDsCODE EDsCODE deleted the eric/duckgres-data-imports-suffix branch June 23, 2026 18:33
@deployment-status-posthog

deployment-status-posthog Bot commented Jun 23, 2026

Copy link
Copy Markdown

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-23 20:42 UTC Run
prod-us ✅ Deployed 2026-06-23 20:57 UTC Run
prod-eu ✅ Deployed 2026-06-23 21:09 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants