feat(data-warehouse): resolve duckgres data-import schema from team table_suffix#65323
Conversation
…able_suffix
A DuckgresServer is org-scoped and hosts many teams, so the v3 sink writes
each team into its own schema — historically posthog_data_imports_team_{id}.
This threads the team's DuckLakeBackfill.table_suffix (merged #64950, the same
field that names its events/persons duckling tables) through the schema name so
one user-chosen identifier governs all of a team's warehouse tables.
- New shared resolver posthog.ducklake.common.duckgres_data_imports_schema():
NULL/empty suffix -> posthog_data_imports_team_{id} (today's behavior),
set -> posthog_data_imports_{suffix}, fail-closed validated via
validate_duckgres_identifier (mirrors the duckling DAG's _validate_identifier).
- Wired into the v3 sink processor (_duckgres_schema_name) and the legacy
DuckLakeCopyDataImportsWorkflow, so the historical-copy and live-sink paths
never diverge on schema for the same team.
Backward-compatible no-op while all suffixes are NULL.
Follow-up (depends on backfill #63144): a suffix CHANGE moves the schema and
orphans the old one — detect drift in the DuckgresSinkSchemaState reconciler
and trigger NEEDS_RESYNC rather than silently switching. Tracked separately.
Tests: resolver fallback/set/unsafe-suffix + validator accept/reject; processor
tests pin the resolver via an autouse stub.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
PR overviewAll previously flagged issues have been addressed. No open security concerns remain on this pull request. Security reviewNo open security issues remain on this pull request. Fixed/addressed: 1 · PR risk: 0/10 |
…imports-suffix # Conflicts: # posthog/ducklake/common.py
| if not suffix: | ||
| return f"posthog_data_imports_team_{team_id}" | ||
| validate_duckgres_identifier(suffix) | ||
| return f"posthog_data_imports_{suffix}" |
There was a problem hiding this comment.
DuckgresServer is org-scoped, but this resolver maps a user-controlled DuckLakeBackfill.table_suffix directly to a shared schema name without preserving team_id or checking uniqueness. A team that can set its suffix to another team’s suffix, or to team_<id> for a legacy schema, can make its import/copy jobs create or replace tables in that other team’s data-import schema, breaking tenant isolation inside the org-scoped warehouse.
Prompt To Fix With AI
Keep team isolation in duckgres data-import schemas. Either keep `team_id` in the schema name even when a suffix is set (for example `posthog_data_imports_team_<team_id>_<suffix>`), or enforce and verify uniqueness of `DuckLakeBackfill.table_suffix` within each organization/DuckgresServer before it can ever be used here, including preventing collisions with legacy `posthog_data_imports_team_<id>` names. Update all writers/readers consistently and add tests covering cross-team collision attempts.Severity: high | Confidence: 96% | React with 👍 if useful or 👎 if not
duckgres_data_imports_schema(inputs.team_id) was called inside the per-schema loop in prepare_data_imports_ducklake_metadata_activity, issuing one identical DuckLakeBackfill lookup per schema_id. All schemas in a copy batch belong to the same team, so the result is loop-invariant — hoist it above the loop into a single query. Behavior is unchanged (same resolved schema name); only the redundant queries are removed. Addresses Greptile review comment on PR #65323. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
A
DuckgresServeris org-scoped and hosts many teams, so the v3 data-import sink writes each team into its own schema — historicallyposthog_data_imports_team_{team_id}. This threads the team'sDuckLakeBackfill.table_suffix(merged #64950 — the same field that names a team's events/persons duckling tables) through the schema name, so one user-chosen identifier governs all of a team's warehouse tables instead of an opaque team id.Changes
posthog.ducklake.common.duckgres_data_imports_schema(team_id)+validate_duckgres_identifier(fail-closed alnum/underscore, mirrors the duckling DAG's_validate_identifierso the suffix is validated identically wherever it hits DuckDB DDL)._duckgres_schema_name) and the legacyDuckLakeCopyDataImportsWorkflow, so the historical-copy and live-sink paths never diverge on schema for the same team. (The copy workflow already stands down for sink-enabled teams via the flag gate; this keeps non-sink suffix teams consistent too.)Backward-compatible no-op while all suffixes are NULL — existing teams are unaffected until a suffix is explicitly set.
Follow-ups / call-outs (not in this PR)
team_idwas immutable; a suffix is not. Changing it moves the schema and orphans the old one. The clean fix is to record the primed schema onDuckgresSinkSchemaStateand transition toNEEDS_RESYNCon drift — that depends on the backfill PR (feat(data-warehouse): duckgres sink backfill primer #63144), so it's a tracked follow-up. Until then a change would silently start a new schema.events_<suffix>would). The provision/enable flow (feat(data-warehouse): provision/enable per-team warehouse backfill with named tables #64939) must enforce it.table_suffixwill move schemas on deploy and needs a one-time re-prime — not a silent switch.Tests
Resolver fallback (no row / NULL / empty) → team-id schema; set → suffixed; unsafe suffix →
ValueError; validator accept/reject cases. Processor unit tests pin the resolver via an autouse stub.🤖 Generated with Claude Code