Fix Iceberg ARRAY columns with dot-separated names returning empty lists by il9ue · Pull Request #1826 · Altinity/ClickHouse

il9ue · 2026-05-22T09:34:14Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix reading Iceberg tables whose ARRAY column names contain a dot (e.g. `a.b` ARRAY<STRING>), which previously returned empty arrays. Two upstream defects were responsible: ColumnsDescription::getAllRegisteredNames filtered out dotted names, and NestedUtils::getSubcolumnsOfNested misclassified lone dotted Array(T) columns as flattened Nested children.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

CI/CD Options

Exclude tests:

Regression jobs to run:

When querying an Iceberg table through the `iceberg(...)` table function or a DataLakeCatalog, a column whose name contains a `.` and whose type is `Array(T)` (e.g. `` `a.b` ARRAY<STRING> ``) returned empty arrays instead of the stored values. The same data read by Spark returned the expected values. Fixes ClickHouse#90731. The Parquet V3 reader path (`SchemaConverter` + `ColumnMapper` + `FormatFilterInfo`) is already correct after the dotted-name field-id work in 0a218cd, 4b733ba and f24c1a4. This change addresses two residual upstream defects that affect dotted-name `Array(T)` columns regardless of source: * `ColumnsDescription::getAllRegisteredNames` explicitly filtered out any column whose name contained `.`, under the assumption such names were always flattened Nested subcolumns. A column whose stored name literally contains a dot (allowed by MergeTree with backticks, and produced by Iceberg / Spark) is a first-class registered name and must appear in `IHints` misspelling suggestions. The function is only consumed by `IHints`-style suggestion paths (and by `StorageSystemZooKeeper` for column-name iteration, where no dotted names exist), so relaxing it has no effect on parsing, planning, storage, or wire protocol. * `NestedUtils::getSubcolumnsOfNested` treated every `Array(T)` column whose name contained `.` as a flattened element of a synthetic `Nested` structure named after the prefix. This caused the Arrow, ORC and pre-V3 Parquet readers to look for a struct field with the prefix name in the data file rather than the literal dotted column, returning an empty array. The fix uses a two-pass scan: a synthetic `Nested` entry is only emitted when at least two `Array(T)` columns share the same dotted prefix. A lone column such as `a.b: Array(T)` no longer appears in the synthetic-Nested map. Genuine flattened `Nested` with multiple fields is unaffected; the existing early-continue on `isNested()` also covers the one-field-Nested edge case. Tests: * `tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column` — end-to-end repro of ClickHouse#90731 against s3, azure and local storage. * `test_dotted_array_alongside_real_nested` in the same file — mixed- schema regression guard verifying a lone dotted `Array` column coexists with genuine flattened-Nested siblings. * `tests/queries/0_stateless/04259_dotted_array_not_nested.sql` — isolates Bug B without Iceberg. * `tests/queries/0_stateless/04260_dotted_column_in_hints.sh` — verifies Bug A by checking the misspelling hint output. Changelog category (leave one): - Bug Fix (user-visible misbehavior in an official stable release) Changelog entry: Fix reading Iceberg tables whose `ARRAY` column names contain a dot (e.g. `` `a.b` ARRAY<STRING> ``), which previously returned empty arrays. Two upstream defects were responsible: `ColumnsDescription::getAllRegisteredNames` filtered out dotted names, and `NestedUtils::getSubcolumnsOfNested` misclassified lone dotted `Array(T)` columns as flattened `Nested` children. (cherry picked from commit f8467af)

il9ue · 2026-05-22T09:34:25Z

Backport of upstream fix for ClickHouse#90731

Backport of f8467afa849f7ce5aec7a7d372b00fdabf13b4b1 (upstream PR: <UPSTREAM_PR_URL>) onto antalya-26.3. Cherry-pick applied cleanly with no contextual conflicts.

Fixes the customer-reported symptom from ClickHouse/ClickHouse#90731 against the 26.3 release line.

Actual fix

Symptom

When querying an Iceberg table through the iceberg(...) table function or a DataLakeCatalog, a column whose name contains a . and whose type is Array(T) (e.g. `a.b` ARRAY<STRING>) returned empty arrays instead of the stored values. The same data read by Spark returned the expected values.

-- Spark
CREATE TABLE table7 (`a.b` ARRAY<STRING>);
INSERT INTO table7 VALUES (ARRAY('a','b','c'));

-- ClickHouse (before fix)
SELECT `a.b` FROM iceberg('...');
-- got:      [ ]
-- expected: ['a','b','c']

Root cause

The Parquet V3 reader path (SchemaConverter + ColumnMapper + FormatFilterInfo) is already correct after the dotted-name field-id work in 0a218cd4e8b, 4b733bae561 and f24c1a46063 (all merged into upstream master, also present in antalya-26.3). The remaining symptom is caused by two upstream defects, independent of Iceberg but exposed by it:

ColumnsDescription::getAllRegisteredNames explicitly filtered out any column whose name contained ., under the assumption such names were always flattened Nested subcolumns. A column whose stored name literally contains a dot (allowed by MergeTree with backticks, and produced by Iceberg / Spark) is a first-class registered name and must appear in IHints misspelling suggestions. The function is only consumed by IHints-style suggestion paths (and by StorageSystemZooKeeper for column-name iteration, where no dotted names exist), so relaxing it has no effect on parsing, planning, storage, or wire protocol.
NestedUtils::getSubcolumnsOfNested treated every Array(T) column whose name contained . as a flattened element of a synthetic Nested structure named after the prefix. This caused the Arrow, ORC and pre-V3 Parquet readers to look for a struct field with the prefix name in the data file rather than the literal dotted column, returning an empty array.

Fix

ColumnsDescription::getAllRegisteredNames — drop the dot filter; return every registered column name.
NestedUtils::getSubcolumnsOfNested — use a two-pass scan: a synthetic Nested entry is only emitted when at least two Array(T) columns share the same dotted prefix. A lone column such as a.b: Array(T) no longer appears in the synthetic-Nested map. Genuine flattened Nested with multiple fields is unaffected; the existing early-continue on isNested() covers the one-field-Nested edge case.

Tests

tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column — end-to-end repro of Iceberg: Selecting from ARRAY column with dot-separated name returns empty lists ClickHouse/ClickHouse#90731 against s3, azure and local storage.
tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_alongside_real_nested — mixed-schema regression guard verifying a lone dotted Array column coexists with a genuine flattened-Nested sibling group sharing a different prefix.
tests/queries/0_stateless/04259_dotted_array_not_nested.sql — isolates the NestedUtils fix without Iceberg using the Memory engine.
tests/queries/0_stateless/04260_dotted_column_in_hints.sh — verifies the ColumnsDescription fix by checking the misspelling-hint output.

Risk

Low. Five-line removal in ColumnsDescription (hint suggestions only) and a localised two-pass refactor in NestedUtils::getSubcolumnsOfNested guarded by the existing isNested() early-continue. No header changes, no new settings, no public API surface change.

Scope

No new settings.
No header changes (ColumnsDescription.h, NestedUtils.h untouched).
Parquet V3 path (SchemaConverter, Reader, FormatFilterInfo, ArrowColumnToCHColumn, Storages/ObjectStorage/DataLakes/Iceberg/*) is not modified — already fixed by the three commits cited above.
CHANGELOG.md is not edited directly; entry goes in via the Changelog section above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1826

Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1826
il9ue wants to merge 1 commit into
Altinity:antalya-26.3from
il9ue:fix/iceberg-dotted-array-90731-antalya-26.3

il9ue commented May 22, 2026

Uh oh!

il9ue commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

il9ue commented May 22, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

il9ue commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backport of upstream fix for ClickHouse#90731

Symptom

Root cause

Fix

Tests

Risk

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

il9ue commented May 22, 2026 •

edited

Loading