Skip to content

Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1826

Open
il9ue wants to merge 1 commit into
Altinity:antalya-26.3from
il9ue:fix/iceberg-dotted-array-90731-antalya-26.3
Open

Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1826
il9ue wants to merge 1 commit into
Altinity:antalya-26.3from
il9ue:fix/iceberg-dotted-array-90731-antalya-26.3

Conversation

@il9ue
Copy link
Copy Markdown

@il9ue il9ue commented May 22, 2026

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix reading Iceberg tables whose ARRAY column names contain a dot (e.g. `a.b` ARRAY<STRING>), which previously returned empty arrays. Two upstream defects were responsible: ColumnsDescription::getAllRegisteredNames filtered out dotted names, and NestedUtils::getSubcolumnsOfNested misclassified lone dotted Array(T) columns as flattened Nested children.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

When querying an Iceberg table through the `iceberg(...)` table function
or a DataLakeCatalog, a column whose name contains a `.` and whose type
is `Array(T)` (e.g. `` `a.b` ARRAY<STRING> ``) returned empty arrays
instead of the stored values. The same data read by Spark returned the
expected values. Fixes ClickHouse#90731.

The Parquet V3 reader path (`SchemaConverter` + `ColumnMapper` +
`FormatFilterInfo`) is already correct after the dotted-name field-id
work in 0a218cd, 4b733ba and f24c1a4. This change addresses
two residual upstream defects that affect dotted-name `Array(T)`
columns regardless of source:

* `ColumnsDescription::getAllRegisteredNames` explicitly filtered out
  any column whose name contained `.`, under the assumption such names
  were always flattened Nested subcolumns. A column whose stored name
  literally contains a dot (allowed by MergeTree with backticks, and
  produced by Iceberg / Spark) is a first-class registered name and
  must appear in `IHints` misspelling suggestions. The function is only
  consumed by `IHints`-style suggestion paths (and by
  `StorageSystemZooKeeper` for column-name iteration, where no dotted
  names exist), so relaxing it has no effect on parsing, planning,
  storage, or wire protocol.

* `NestedUtils::getSubcolumnsOfNested` treated every `Array(T)` column
  whose name contained `.` as a flattened element of a synthetic
  `Nested` structure named after the prefix. This caused the Arrow,
  ORC and pre-V3 Parquet readers to look for a struct field with the
  prefix name in the data file rather than the literal dotted column,
  returning an empty array. The fix uses a two-pass scan: a synthetic
  `Nested` entry is only emitted when at least two `Array(T)` columns
  share the same dotted prefix. A lone column such as `a.b: Array(T)`
  no longer appears in the synthetic-Nested map. Genuine flattened
  `Nested` with multiple fields is unaffected; the existing
  early-continue on `isNested()` also covers the one-field-Nested
  edge case.

Tests:
* `tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column` —
  end-to-end repro of ClickHouse#90731 against s3, azure and local storage.
* `test_dotted_array_alongside_real_nested` in the same file — mixed-
  schema regression guard verifying a lone dotted `Array` column
  coexists with genuine flattened-Nested siblings.
* `tests/queries/0_stateless/04259_dotted_array_not_nested.sql` —
  isolates Bug B without Iceberg.
* `tests/queries/0_stateless/04260_dotted_column_in_hints.sh` —
  verifies Bug A by checking the misspelling hint output.

Changelog category (leave one):
- Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry:
Fix reading Iceberg tables whose `ARRAY` column names contain a dot
(e.g. `` `a.b` ARRAY<STRING> ``), which previously returned empty
arrays. Two upstream defects were responsible:
`ColumnsDescription::getAllRegisteredNames` filtered out dotted names,
and `NestedUtils::getSubcolumnsOfNested` misclassified lone dotted
`Array(T)` columns as flattened `Nested` children.

(cherry picked from commit f8467af)
@il9ue
Copy link
Copy Markdown
Author

il9ue commented May 22, 2026

Backport of upstream fix for ClickHouse#90731

Backport of f8467afa849f7ce5aec7a7d372b00fdabf13b4b1 (upstream PR: <UPSTREAM_PR_URL>) onto antalya-26.3. Cherry-pick applied cleanly with no contextual conflicts.

Fixes the customer-reported symptom from ClickHouse/ClickHouse#90731 against the 26.3 release line.

Actual fix

Symptom

When querying an Iceberg table through the iceberg(...) table function or a DataLakeCatalog, a column whose name contains a . and whose type is Array(T) (e.g. `a.b` ARRAY<STRING>) returned empty arrays instead of the stored values. The same data read by Spark returned the expected values.

-- Spark
CREATE TABLE table7 (`a.b` ARRAY<STRING>);
INSERT INTO table7 VALUES (ARRAY('a','b','c'));

-- ClickHouse (before fix)
SELECT `a.b` FROM iceberg('...');
-- got:      [ ]
-- expected: ['a','b','c']

Root cause

The Parquet V3 reader path (SchemaConverter + ColumnMapper + FormatFilterInfo) is already correct after the dotted-name field-id work in 0a218cd4e8b, 4b733bae561 and f24c1a46063 (all merged into upstream master, also present in antalya-26.3). The remaining symptom is caused by two upstream defects, independent of Iceberg but exposed by it:

  1. ColumnsDescription::getAllRegisteredNames explicitly filtered out any column whose name contained ., under the assumption such names were always flattened Nested subcolumns. A column whose stored name literally contains a dot (allowed by MergeTree with backticks, and produced by Iceberg / Spark) is a first-class registered name and must appear in IHints misspelling suggestions. The function is only consumed by IHints-style suggestion paths (and by StorageSystemZooKeeper for column-name iteration, where no dotted names exist), so relaxing it has no effect on parsing, planning, storage, or wire protocol.

  2. NestedUtils::getSubcolumnsOfNested treated every Array(T) column whose name contained . as a flattened element of a synthetic Nested structure named after the prefix. This caused the Arrow, ORC and pre-V3 Parquet readers to look for a struct field with the prefix name in the data file rather than the literal dotted column, returning an empty array.

Fix

  • ColumnsDescription::getAllRegisteredNames — drop the dot filter; return every registered column name.
  • NestedUtils::getSubcolumnsOfNested — use a two-pass scan: a synthetic Nested entry is only emitted when at least two Array(T) columns share the same dotted prefix. A lone column such as a.b: Array(T) no longer appears in the synthetic-Nested map. Genuine flattened Nested with multiple fields is unaffected; the existing early-continue on isNested() covers the one-field-Nested edge case.

Tests

  • tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column — end-to-end repro of Iceberg: Selecting from ARRAY column with dot-separated name returns empty lists ClickHouse/ClickHouse#90731 against s3, azure and local storage.
  • tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_alongside_real_nested — mixed-schema regression guard verifying a lone dotted Array column coexists with a genuine flattened-Nested sibling group sharing a different prefix.
  • tests/queries/0_stateless/04259_dotted_array_not_nested.sql — isolates the NestedUtils fix without Iceberg using the Memory engine.
  • tests/queries/0_stateless/04260_dotted_column_in_hints.sh — verifies the ColumnsDescription fix by checking the misspelling-hint output.

Risk

Low. Five-line removal in ColumnsDescription (hint suggestions only) and a localised two-pass refactor in NestedUtils::getSubcolumnsOfNested guarded by the existing isNested() early-continue. No header changes, no new settings, no public API surface change.

Scope

  • No new settings.
  • No header changes (ColumnsDescription.h, NestedUtils.h untouched).
  • Parquet V3 path (SchemaConverter, Reader, FormatFilterInfo, ArrowColumnToCHColumn, Storages/ObjectStorage/DataLakes/Iceberg/*) is not modified — already fixed by the three commits cited above.
  • CHANGELOG.md is not edited directly; entry goes in via the Changelog section above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant