Skip to content

Support BigQuery nested STRUCT fields in anomaly tests#1012

Open
tlangton3 wants to merge 3 commits into
elementary-data:masterfrom
tlangton3:bigquery-nested-struct-support
Open

Support BigQuery nested STRUCT fields in anomaly tests#1012
tlangton3 wants to merge 3 commits into
elementary-data:masterfrom
tlangton3:bigquery-nested-struct-support

Conversation

@tlangton3

@tlangton3 tlangton3 commented May 22, 2026

Copy link
Copy Markdown

Allows column_anomalies and dimension_anomalies to reference nested STRUCT leaves on BigQuery (e.g. user.address.city) instead of only top-level columns.

A single column-discovery wrapper segment-quotes nested references (`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias so the path survives into downstream aggregates. Non-nested columns and non-BigQuery adapters compile byte-identically to today's behaviour. REPEATED ancestors are out of scope (would require UNNEST). test_all_columns_anomalies is unchanged — users opt in by passing column_name=user.address.city explicitly to avoid ballooning the test surface on wide STRUCT schemas.

What changes

  • get_column_obj_and_monitors flattens BigQuery STRUCT columns via BigQueryColumn.flatten(), filtered through an ancestor-aware walker (bq_safe_leaf_names) so leaves under REPEATED ancestors are excluded. Each discovered column is wrapped with a dict carrying .name (dotted display form), .quoted (segment-quoted SQL ref), .safe_alias (dot-free identifier) and .is_nested. Top-level STRUCTs are kept alongside their leaves so existing column_name=user behaviour is preserved.
  • column_monitoring_query projects nested columns as <quoted> as <adapter-quoted safe_alias> and references the quoted alias in metric aggregates. Non-nested columns keep today's projection and .quoted references untouched, so identifier quoting (reserved words, case-sensitive names) is never lost.
  • bq_is_nested_identifier matches only plain dotted identifier paths (^\w+(\.\w+)+$) on BigQuery. Since dimensions accepts arbitrary SQL expressions, segment-quoting in dimension_monitoring_query and select_dimensions_columns is gated on this predicate — expressions and plain identifiers pass through byte-identically to master.

Why two representations

BigQueryColumn.quoted wraps the whole string in one set of backticks, so a flattened nested column's .quoted is `user.address.city` — which BigQuery treats as a single column literally named user.address.city. Even with correct segment-quoting, projecting select user.address.city from t into a CTE without an alias names the resulting column city, losing the path. The wrapper exposes both .quoted (segment-quoted source ref) and .safe_alias (dot-free CTE alias) so the projection-alias pattern composes cleanly and downstream macros stay nesting-agnostic. The alias is only emitted when .is_nested is true; otherwise safe_alias mirrors .quoted.

Testing

Local validation via dbt parse and a run-operation harness against the BigQuery adapter confirmed every SQL fingerprint:

  • Segment-quoting: user.address.city`user`.`address`.`city`
  • Projection: select `user`.`address`.`city` as `user__address__city` from t; non-nested projection byte-identical to master
  • Downstream aggregate references `user__address__city` when nested, .quoted otherwise (byte-identical to master)
  • Expression dimensions (case when amount > 100 then 'high' end) and dotted expressions (coalesce(user.a, user.b)) pass through unchanged
  • Stored column_name: user.address.city (dotted display preserved for alerts)
  • get_column_data_type BigQuery dispatch works on the wrapped dict via subscript access

End-to-end execution against BigQuery to follow.

Summary by CodeRabbit

  • Bug Fixes

    • Better handling of nested/struct fields in BigQuery so monitors correctly detect and report on dotted/nested column leaf values.
    • Safer column and dimension aliasing to avoid invalid identifiers in monitoring outputs.
  • Refactor

    • Reworked monitor selection and dimension concatenation logic for more reliable results with structured data types and complex naming.

Review Change Stack

Allows column_anomalies and dimension_anomalies to reference nested STRUCT
leaves on BigQuery (e.g. user.address.city) instead of only top-level
columns.

A single column-discovery wrapper segment-quotes nested references
(`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias
so the path survives into downstream aggregates. Non-nested columns and
non-BigQuery adapters are byte-equivalent to today's behaviour. REPEATED
ancestors are out of scope (would require UNNEST).
test_all_columns_anomalies is unchanged - users opt in by passing
column_name=user.address.city explicitly to avoid ballooning the test
surface on wide STRUCT schemas.
@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds BigQuery-safe segment quoting, dot-free aliasing, and struct-wrapping helpers; applies them when selecting column monitors, projecting monitored columns and metric expressions, and when building concatenated dimension expressions.

Changes

BigQuery Nested Field Support via Safe Aliasing

Layer / File(s) Summary
Helper macros for safe BigQuery column handling
macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql
Adds bq_segment_quote, bq_safe_alias, wrap_column_for_struct_support, plus bq_safe_leaf_names and _bq_walk_collect for STRUCT leaf discovery; overhauls select_dimensions_columns to segment-quote sources and generate dot-free alias suffixes for nested fields.
Column monitoring query integration
macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql
column_monitoring_query now projects monitored columns using column_obj.safe_alias when nested and uses that alias for metric expressions; prefixed_dimensions builds dimension_* aliases with bq_safe_alias().
Column monitor configuration wrapping
macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql
get_column_obj_and_monitors and get_all_column_obj_and_monitors wrap column_obj via wrap_column_for_struct_support before deriving data types and selecting monitors; returned column values are the wrapped objects.
Dimension monitoring query updates
macros/edr/data_monitoring/monitors_query/dimension_monitoring_query.sql
Builds concatenated dimension expressions using bq_segment_quote per segment and joins them with "; " instead of joining raw dimension strings.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through dotted fields with glee,
Quoted each segment so queries run free,
Dots turned to underscores, tidy and bright,
Wrapped structs now yield metrics just right,
A small rabbit cheer for safer SQL tonight!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding support for BigQuery nested STRUCT fields in anomaly tests, which is the primary objective across all three modified files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

👋 @tlangton3
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

@tlangton3 tlangton3 requested a deployment to elementary_test_env May 22, 2026 10:36 — with GitHub Actions Waiting
@tlangton3

Copy link
Copy Markdown
Author

End-to-end validated against a real BigQuery dataset.

  • column_anomalies on a three-level nested STRUCT field (<parent>.<intermediate>.<leaf>) compiles with segment-quoted SQL, executes against real data, and writes a row to data_monitoring_metrics with the dotted column_name preserved.
  • Discovery layer correctly flattens parent STRUCTs via BigQueryColumn.flatten(); the wrapper exposes .name (dotted display), .quoted (segment-quoted SQL ref), and .safe_alias (dot-free CTE alias) as designed.
  • Ran the new nested test alongside 10+ existing non-nested column_anomalies tests in a single dbt test invocation — all 15 PASS with no interference, confirming the projection-alias pattern is backwards-compatible.
  • Re-ran with --defer --favor-state against a prod manifest so the non-nested tests had data and history; metrics for nested and non-nested columns land in data_monitoring_metrics and elementary_test_results with identical schema. The dotted column_name is just a longer string in an otherwise unchanged structure.
  • elementary.on_run_end upload hook works unchanged with the override — metric history persists correctly.

Tested against:

  • dbt-core 1.11.8 / dbt-bigquery 1.11.1
  • elementary package version 0.23.x (this branch)

@tlangton3 tlangton3 marked this pull request as ready for review May 22, 2026 13:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql`:
- Around line 10-13: The loop currently excludes only leaves whose own leaf.mode
== 'REPEATED', but needs to exclude any leaf that has a REPEATED ancestor so
downstream UNNESTs aren't missed; change the logic around the col.flatten()
iteration to skip a leaf if any ancestor in its flattened path is REPEATED
(e.g., inspect the leaf's ancestry/path metadata returned by col.flatten() or
augment flatten to return ancestor modes), and only do expanded.append(leaf)
when no ancestor mode == 'REPEATED' (retain the existing reference to
col.flatten(), leaf.mode, and expanded.append in your change).

In `@macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql`:
- Around line 402-423: The macro wrap_column_for_struct_support currently always
includes 'fields': column_obj.fields which breaks non-BigQuery adapters because
dbt's base Column lacks a fields attribute; update the macro to only set the
'fields' key when the attribute exists (e.g. when target.type == 'bigquery' and
column_obj.fields is defined) or use a defined-check (column_obj.fields is
defined) and otherwise omit or set fields to null/empty, ensuring all references
inside the returned dict (name, column, quoted, safe_alias, dtype, data_type,
fields) remain valid for non-BigQuery Column objects.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 281a98fa-e3f9-47ef-b12d-ec7d113d1681

📥 Commits

Reviewing files that changed from the base of the PR and between ab1a10b and d45a775.

📒 Files selected for processing (3)
  • macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql
  • macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql
  • macros/edr/data_monitoring/monitors_query/dimension_monitoring_query.sql

Address CodeRabbit findings:

1. `BigQueryColumn.flatten()` discards ancestor modes, so a NULLABLE leaf
   under a REPEATED ancestor still satisfied the previous `leaf.mode !=
   'REPEATED'` filter. Add `bq_safe_leaf_names` + `_bq_walk_collect`, an
   ancestor-aware walker that returns only leaves with no REPEATED
   ancestor in their path. Filter `flatten()` output against this set.

2. `wrap_column_for_struct_support` unconditionally read `column_obj.fields`,
   which raised on non-BigQuery adapters (base `Column` lacks `fields`).
   Guard with `column_obj.fields is defined` and default to an empty list,
   so the wrapper is safe on Snowflake, Postgres, Redshift, etc.
@tlangton3 tlangton3 requested a deployment to elementary_test_env May 22, 2026 14:14 — with GitHub Actions Waiting
1. Non-nested columns regained their adapter quoting: the wrapper now
   carries an is_nested flag, safe_alias falls back to Column.quoted, the
   CTE projection only emits an alias for nested columns, and metric
   aggregates reference adapter.quote(safe_alias) when nested or .quoted
   otherwise. Compiled SQL for non-nested columns is byte-identical to
   master on every adapter (previously the alias and aggregate references
   were unquoted, breaking reserved-word / quoted-identifier columns).

2. Dimensions are documented as accepting arbitrary SQL expressions, so
   unconditional backticking on BigQuery broke expression dimensions
   (e.g. case when ... end). Add bq_is_nested_identifier, which matches
   only plain dotted identifier paths via modules.re, and gate
   bq_segment_quote, select_dimensions_columns and the dimension_
   prefixing on it. Plain identifiers and expressions pass through
   byte-identically to master.

3. Restore the explanatory comments in dimension_monitoring_query.sql
   that were unintentionally stripped; the file is now master plus only
   the dimension segment-quoting block.
@tlangton3 tlangton3 requested a deployment to elementary_test_env June 11, 2026 13:42 — with GitHub Actions Waiting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant