Skip to content

[iceberg-rust] Manifest writer declares bucket() partition field as String instead of Int32 #125

@rampage644

Description

@rampage644

Filed here because issues are disabled on `Embucket/iceberg-rust`. The fix lives in that fork.

Summary

When writing to an Iceberg table whose partition spec uses `bucket(N, col)`, the iceberg-rust manifest writer builds the Avro schema for the `data_file.partition` struct with the bucket field typed as a nullable string. But `transform_arrow()` produces an `Int32` hash for `Bucket`, so serialization fails at commit time:

```
Iceberg error: Failed to serialize field 'data_file' for record
Record(RecordSchema { name: Name { name: "manifest_entry", ... },
fields: [... RecordField { name: "data_file", ... schema: Record(RecordSchema {
name: Name { name: "r2", ... },
fields: [... RecordField { name: "partition", ... schema: Record(RecordSchema {
name: Name { name: "r102", ... },
fields: [RecordField {
name: "id_bucket", ...,
schema: Union(UnionSchema { schemas: [Null, String], ... }), ← wrong type
...
}],
...
```

Expected partition field type: `Union(Null, Int)` (matching the `Int32Type` returned by `transform_arrow` for `Bucket`).

Repro

Create any Iceberg table partitioned by `bucket(N, col)` via Athena — `WITH (table_type = 'ICEBERG', partitioning = ARRAY['bucket(4, id)'])`. Seed one row. Then from Embucket run any MERGE or UPDATE on that target. Seen against the probe table `atomic.merge_test_bucket` on S3 Tables during the verification of Embucket/iceberg-rust#57.

Likely location

Wherever `iceberg-rust` derives the partition-struct Avro schema from the Iceberg partition spec. The per-field result type for each transform should match what `transform_arrow` produces: `Bucket() → Int`, `Truncate() → same as source`, `Day/Month/Year → Int`, `Hour → Int`, `Identity → source type`. Almost certainly a single table of `Transform → Iceberg result type` that's missing or wrong for `Bucket`.

Related

Unmasked once Embucket/iceberg-rust#57 landed — before that, the projection schema mismatch on partitioned targets short-circuited every MERGE before the manifest writer ran.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions