Support BigQuery nested STRUCT fields in anomaly tests by tlangton3 · Pull Request #1012 · elementary-data/dbt-data-reliability

tlangton3 · 2026-05-22T10:35:56Z

Allows column_anomalies and dimension_anomalies to reference nested STRUCT leaves on BigQuery (e.g. user.address.city) instead of only top-level columns.

A single column-discovery wrapper segment-quotes nested references (`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias so the path survives into downstream aggregates. Non-nested columns and non-BigQuery adapters compile byte-identically to today's behaviour. REPEATED ancestors are out of scope (would require UNNEST). test_all_columns_anomalies is unchanged — users opt in by passing column_name=user.address.city explicitly to avoid ballooning the test surface on wide STRUCT schemas.

What changes

get_column_obj_and_monitors flattens BigQuery STRUCT columns via BigQueryColumn.flatten(), filtered through an ancestor-aware walker (bq_safe_leaf_names) so leaves under REPEATED ancestors are excluded. Each discovered column is wrapped with a dict carrying .name (dotted display form), .quoted (segment-quoted SQL ref), .safe_alias (dot-free identifier) and .is_nested. Top-level STRUCTs are kept alongside their leaves so existing column_name=user behaviour is preserved.
column_monitoring_query projects nested columns as <quoted> as <adapter-quoted safe_alias> and references the quoted alias in metric aggregates. Non-nested columns keep today's projection and .quoted references untouched, so identifier quoting (reserved words, case-sensitive names) is never lost.
bq_is_nested_identifier matches only plain dotted identifier paths (^\w+(\.\w+)+$) on BigQuery. Since dimensions accepts arbitrary SQL expressions, segment-quoting in dimension_monitoring_query and select_dimensions_columns is gated on this predicate — expressions and plain identifiers pass through byte-identically to master.

Why two representations

BigQueryColumn.quoted wraps the whole string in one set of backticks, so a flattened nested column's .quoted is `user.address.city` — which BigQuery treats as a single column literally named user.address.city. Even with correct segment-quoting, projecting select user.address.city from t into a CTE without an alias names the resulting column city, losing the path. The wrapper exposes both .quoted (segment-quoted source ref) and .safe_alias (dot-free CTE alias) so the projection-alias pattern composes cleanly and downstream macros stay nesting-agnostic. The alias is only emitted when .is_nested is true; otherwise safe_alias mirrors .quoted.

Testing

Local validation via dbt parse and a run-operation harness against the BigQuery adapter confirmed every SQL fingerprint:

Segment-quoting: user.address.city → `user`.`address`.`city`
Projection: select `user`.`address`.`city` as `user__address__city` from t; non-nested projection byte-identical to master
Downstream aggregate references `user__address__city` when nested, .quoted otherwise (byte-identical to master)
Expression dimensions (case when amount > 100 then 'high' end) and dotted expressions (coalesce(user.a, user.b)) pass through unchanged
Stored column_name: user.address.city (dotted display preserved for alerts)
get_column_data_type BigQuery dispatch works on the wrapped dict via subscript access

End-to-end execution against BigQuery to follow.

Summary by CodeRabbit

Bug Fixes
- Better handling of nested/struct fields in BigQuery so monitors correctly detect and report on dotted/nested column leaf values.
- Safer column and dimension aliasing to avoid invalid identifiers in monitoring outputs.
Refactor
- Reworked monitor selection and dimension concatenation logic for more reliable results with structured data types and complex naming.

Allows column_anomalies and dimension_anomalies to reference nested STRUCT leaves on BigQuery (e.g. user.address.city) instead of only top-level columns. A single column-discovery wrapper segment-quotes nested references (`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias so the path survives into downstream aggregates. Non-nested columns and non-BigQuery adapters are byte-equivalent to today's behaviour. REPEATED ancestors are out of scope (would require UNNEST). test_all_columns_anomalies is unchanged - users opt in by passing column_name=user.address.city explicitly to avoid ballooning the test surface on wide STRUCT schemas.

coderabbitai · 2026-05-22T10:36:05Z

📝 Walkthrough

Walkthrough

Adds BigQuery-safe segment quoting, dot-free aliasing, and struct-wrapping helpers; applies them when selecting column monitors, projecting monitored columns and metric expressions, and when building concatenated dimension expressions.

Changes

BigQuery Nested Field Support via Safe Aliasing

Layer / File(s)	Summary
Helper macros for safe BigQuery column handling `macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql`	Adds `bq_segment_quote`, `bq_safe_alias`, `wrap_column_for_struct_support`, plus `bq_safe_leaf_names` and `_bq_walk_collect` for STRUCT leaf discovery; overhauls `select_dimensions_columns` to segment-quote sources and generate dot-free alias suffixes for nested fields.
Column monitoring query integration `macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql`	`column_monitoring_query` now projects monitored columns using `column_obj.safe_alias` when nested and uses that alias for metric expressions; `prefixed_dimensions` builds `dimension_*` aliases with `bq_safe_alias()`.
Column monitor configuration wrapping `macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql`	`get_column_obj_and_monitors` and `get_all_column_obj_and_monitors` wrap `column_obj` via `wrap_column_for_struct_support` before deriving data types and selecting monitors; returned column values are the wrapped objects.
Dimension monitoring query updates `macros/edr/data_monitoring/monitors_query/dimension_monitoring_query.sql`	Builds concatenated dimension expressions using `bq_segment_quote` per segment and joins them with `"; "` instead of joining raw dimension strings.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through dotted fields with glee,
Quoted each segment so queries run free,
Dots turned to underscores, tidy and bright,
Wrapped structs now yield metrics just right,
A small rabbit cheer for safer SQL tonight!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding support for BigQuery nested STRUCT fields in anomaly tests, which is the primary objective across all three modified files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-22T10:36:06Z

👋 @tlangton3
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

tlangton3 · 2026-05-22T13:50:48Z

End-to-end validated against a real BigQuery dataset.

column_anomalies on a three-level nested STRUCT field (<parent>.<intermediate>.<leaf>) compiles with segment-quoted SQL, executes against real data, and writes a row to data_monitoring_metrics with the dotted column_name preserved.
Discovery layer correctly flattens parent STRUCTs via BigQueryColumn.flatten(); the wrapper exposes .name (dotted display), .quoted (segment-quoted SQL ref), and .safe_alias (dot-free CTE alias) as designed.
Ran the new nested test alongside 10+ existing non-nested column_anomalies tests in a single dbt test invocation — all 15 PASS with no interference, confirming the projection-alias pattern is backwards-compatible.
Re-ran with --defer --favor-state against a prod manifest so the non-nested tests had data and history; metrics for nested and non-nested columns land in data_monitoring_metrics and elementary_test_results with identical schema. The dotted column_name is just a longer string in an otherwise unchanged structure.
elementary.on_run_end upload hook works unchanged with the override — metric history persists correctly.

Tested against:

dbt-core 1.11.8 / dbt-bigquery 1.11.1
elementary package version 0.23.x (this branch)

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql`:
- Around line 10-13: The loop currently excludes only leaves whose own leaf.mode
== 'REPEATED', but needs to exclude any leaf that has a REPEATED ancestor so
downstream UNNESTs aren't missed; change the logic around the col.flatten()
iteration to skip a leaf if any ancestor in its flattened path is REPEATED
(e.g., inspect the leaf's ancestry/path metadata returned by col.flatten() or
augment flatten to return ancestor modes), and only do expanded.append(leaf)
when no ancestor mode == 'REPEATED' (retain the existing reference to
col.flatten(), leaf.mode, and expanded.append in your change).

In `@macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql`:
- Around line 402-423: The macro wrap_column_for_struct_support currently always
includes 'fields': column_obj.fields which breaks non-BigQuery adapters because
dbt's base Column lacks a fields attribute; update the macro to only set the
'fields' key when the attribute exists (e.g. when target.type == 'bigquery' and
column_obj.fields is defined) or use a defined-check (column_obj.fields is
defined) and otherwise omit or set fields to null/empty, ensuring all references
inside the returned dict (name, column, quoted, safe_alias, dtype, data_type,
fields) remain valid for non-BigQuery Column objects.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 281a98fa-e3f9-47ef-b12d-ec7d113d1681

📥 Commits

Reviewing files that changed from the base of the PR and between ab1a10b and d45a775.

📒 Files selected for processing (3)

macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql
macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql
macros/edr/data_monitoring/monitors_query/dimension_monitoring_query.sql

Address CodeRabbit findings: 1. `BigQueryColumn.flatten()` discards ancestor modes, so a NULLABLE leaf under a REPEATED ancestor still satisfied the previous `leaf.mode != 'REPEATED'` filter. Add `bq_safe_leaf_names` + `_bq_walk_collect`, an ancestor-aware walker that returns only leaves with no REPEATED ancestor in their path. Filter `flatten()` output against this set. 2. `wrap_column_for_struct_support` unconditionally read `column_obj.fields`, which raised on non-BigQuery adapters (base `Column` lacks `fields`). Guard with `column_obj.fields is defined` and default to an empty list, so the wrapper is safe on Snowflake, Postgres, Redshift, etc.

1. Non-nested columns regained their adapter quoting: the wrapper now carries an is_nested flag, safe_alias falls back to Column.quoted, the CTE projection only emits an alias for nested columns, and metric aggregates reference adapter.quote(safe_alias) when nested or .quoted otherwise. Compiled SQL for non-nested columns is byte-identical to master on every adapter (previously the alias and aggregate references were unquoted, breaking reserved-word / quoted-identifier columns). 2. Dimensions are documented as accepting arbitrary SQL expressions, so unconditional backticking on BigQuery broke expression dimensions (e.g. case when ... end). Add bq_is_nested_identifier, which matches only plain dotted identifier paths via modules.re, and gate bq_segment_quote, select_dimensions_columns and the dimension_ prefixing on it. Plain identifiers and expressions pass through byte-identically to master. 3. Restore the explanatory comments in dimension_monitoring_query.sql that were unintentionally stripped; the file is now master plus only the dimension segment-quoting block.

tlangton3 requested a deployment to elementary_test_env May 22, 2026 10:36 — with GitHub Actions Waiting

tlangton3 marked this pull request as ready for review May 22, 2026 13:51

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql

Comment thread macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql

tlangton3 requested a deployment to elementary_test_env May 22, 2026 14:14 — with GitHub Actions Waiting

tlangton3 requested a deployment to elementary_test_env June 11, 2026 13:42 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support BigQuery nested STRUCT fields in anomaly tests#1012

Support BigQuery nested STRUCT fields in anomaly tests#1012
tlangton3 wants to merge 3 commits into
elementary-data:masterfrom
tlangton3:bigquery-nested-struct-support

tlangton3 commented May 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

tlangton3 commented May 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tlangton3 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes

Why two representations

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

tlangton3 commented May 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tlangton3 commented May 22, 2026 •

edited

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading