Support BigQuery nested STRUCT fields in anomaly tests#1012
Conversation
Allows column_anomalies and dimension_anomalies to reference nested STRUCT leaves on BigQuery (e.g. user.address.city) instead of only top-level columns. A single column-discovery wrapper segment-quotes nested references (`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias so the path survives into downstream aggregates. Non-nested columns and non-BigQuery adapters are byte-equivalent to today's behaviour. REPEATED ancestors are out of scope (would require UNNEST). test_all_columns_anomalies is unchanged - users opt in by passing column_name=user.address.city explicitly to avoid ballooning the test surface on wide STRUCT schemas.
📝 WalkthroughWalkthroughAdds BigQuery-safe segment quoting, dot-free aliasing, and struct-wrapping helpers; applies them when selecting column monitors, projecting monitored columns and metric expressions, and when building concatenated dimension expressions. ChangesBigQuery Nested Field Support via Safe Aliasing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 @tlangton3 |
|
End-to-end validated against a real BigQuery dataset.
Tested against:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql`:
- Around line 10-13: The loop currently excludes only leaves whose own leaf.mode
== 'REPEATED', but needs to exclude any leaf that has a REPEATED ancestor so
downstream UNNESTs aren't missed; change the logic around the col.flatten()
iteration to skip a leaf if any ancestor in its flattened path is REPEATED
(e.g., inspect the leaf's ancestry/path metadata returned by col.flatten() or
augment flatten to return ancestor modes), and only do expanded.append(leaf)
when no ancestor mode == 'REPEATED' (retain the existing reference to
col.flatten(), leaf.mode, and expanded.append in your change).
In `@macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql`:
- Around line 402-423: The macro wrap_column_for_struct_support currently always
includes 'fields': column_obj.fields which breaks non-BigQuery adapters because
dbt's base Column lacks a fields attribute; update the macro to only set the
'fields' key when the attribute exists (e.g. when target.type == 'bigquery' and
column_obj.fields is defined) or use a defined-check (column_obj.fields is
defined) and otherwise omit or set fields to null/empty, ensuring all references
inside the returned dict (name, column, quoted, safe_alias, dtype, data_type,
fields) remain valid for non-BigQuery Column objects.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 281a98fa-e3f9-47ef-b12d-ec7d113d1681
📒 Files selected for processing (3)
macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sqlmacros/edr/data_monitoring/monitors_query/column_monitoring_query.sqlmacros/edr/data_monitoring/monitors_query/dimension_monitoring_query.sql
Address CodeRabbit findings: 1. `BigQueryColumn.flatten()` discards ancestor modes, so a NULLABLE leaf under a REPEATED ancestor still satisfied the previous `leaf.mode != 'REPEATED'` filter. Add `bq_safe_leaf_names` + `_bq_walk_collect`, an ancestor-aware walker that returns only leaves with no REPEATED ancestor in their path. Filter `flatten()` output against this set. 2. `wrap_column_for_struct_support` unconditionally read `column_obj.fields`, which raised on non-BigQuery adapters (base `Column` lacks `fields`). Guard with `column_obj.fields is defined` and default to an empty list, so the wrapper is safe on Snowflake, Postgres, Redshift, etc.
1. Non-nested columns regained their adapter quoting: the wrapper now carries an is_nested flag, safe_alias falls back to Column.quoted, the CTE projection only emits an alias for nested columns, and metric aggregates reference adapter.quote(safe_alias) when nested or .quoted otherwise. Compiled SQL for non-nested columns is byte-identical to master on every adapter (previously the alias and aggregate references were unquoted, breaking reserved-word / quoted-identifier columns). 2. Dimensions are documented as accepting arbitrary SQL expressions, so unconditional backticking on BigQuery broke expression dimensions (e.g. case when ... end). Add bq_is_nested_identifier, which matches only plain dotted identifier paths via modules.re, and gate bq_segment_quote, select_dimensions_columns and the dimension_ prefixing on it. Plain identifiers and expressions pass through byte-identically to master. 3. Restore the explanatory comments in dimension_monitoring_query.sql that were unintentionally stripped; the file is now master plus only the dimension segment-quoting block.
Allows
column_anomaliesanddimension_anomaliesto reference nested STRUCT leaves on BigQuery (e.g.user.address.city) instead of only top-level columns.A single column-discovery wrapper segment-quotes nested references (
`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias so the path survives into downstream aggregates. Non-nested columns and non-BigQuery adapters compile byte-identically to today's behaviour. REPEATED ancestors are out of scope (would requireUNNEST).test_all_columns_anomaliesis unchanged — users opt in by passingcolumn_name=user.address.cityexplicitly to avoid ballooning the test surface on wide STRUCT schemas.What changes
get_column_obj_and_monitorsflattens BigQuery STRUCT columns viaBigQueryColumn.flatten(), filtered through an ancestor-aware walker (bq_safe_leaf_names) so leaves under REPEATED ancestors are excluded. Each discovered column is wrapped with a dict carrying.name(dotted display form),.quoted(segment-quoted SQL ref),.safe_alias(dot-free identifier) and.is_nested. Top-level STRUCTs are kept alongside their leaves so existingcolumn_name=userbehaviour is preserved.column_monitoring_queryprojects nested columns as<quoted> as <adapter-quoted safe_alias>and references the quoted alias in metric aggregates. Non-nested columns keep today's projection and.quotedreferences untouched, so identifier quoting (reserved words, case-sensitive names) is never lost.bq_is_nested_identifiermatches only plain dotted identifier paths (^\w+(\.\w+)+$) on BigQuery. Sincedimensionsaccepts arbitrary SQL expressions, segment-quoting indimension_monitoring_queryandselect_dimensions_columnsis gated on this predicate — expressions and plain identifiers pass through byte-identically to master.Why two representations
BigQueryColumn.quotedwraps the whole string in one set of backticks, so a flattened nested column's.quotedis`user.address.city`— which BigQuery treats as a single column literally nameduser.address.city. Even with correct segment-quoting, projectingselect user.address.city from tinto a CTE without an alias names the resulting columncity, losing the path. The wrapper exposes both.quoted(segment-quoted source ref) and.safe_alias(dot-free CTE alias) so the projection-alias pattern composes cleanly and downstream macros stay nesting-agnostic. The alias is only emitted when.is_nestedis true; otherwisesafe_aliasmirrors.quoted.Testing
Local validation via
dbt parseand arun-operationharness against the BigQuery adapter confirmed every SQL fingerprint:user.address.city→`user`.`address`.`city`select `user`.`address`.`city` as `user__address__city` from t; non-nested projection byte-identical to master`user__address__city`when nested,.quotedotherwise (byte-identical to master)case when amount > 100 then 'high' end) and dotted expressions (coalesce(user.a, user.b)) pass through unchangedcolumn_name:user.address.city(dotted display preserved for alerts)get_column_data_typeBigQuery dispatch works on the wrapped dict via subscript accessEnd-to-end execution against BigQuery to follow.
Summary by CodeRabbit
Bug Fixes
Refactor