Skip to content

API: Rewrite string truncate equality predicates onto the source column#16362

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:api-rewrite-string-truncate-equality
Open

API: Rewrite string truncate equality predicates onto the source column#16362
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:api-rewrite-string-truncate-equality

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

What

Resolves the long-standing TODO: translate truncate(col) == value to startsWith(value) in UnboundPredicate.bindLiteralOperation. When the term of an EQ/NOT_EQ predicate is a string truncate[W] transform, binding now produces an exactly-equivalent predicate on the untransformed source column.

The equivalence depends on the literal length vs. the truncate width W:

Condition truncate[W](col) == v truncate[W](col) != v
len(v) > W alwaysFalse() alwaysTrue()
len(v) == W col STARTS_WITH v col NOT_STARTS_WITH v
len(v) < W col == v col != v

Integer/long/decimal/binary truncate and all other operators are intentionally left unchanged — they have no exact source-column equivalence.

Why

This rewrite is already assumed by the rest of the engine. InclusiveMetricsEvaluator.startsWith() returns ROWS_MIGHT_MATCH for non-identity transform terms with the explicit comment "truncate must be rewritten in binding". Until now the binder never performed that rewrite for equality, so truncate(col) == v kept an opaque BoundTransform term and metrics/dictionary/partition pruning could not use the column. After this change such predicates prune correctly (e.g. equal(truncate("str",3),"xyz") against bounds ["abc","abe"] now skips the file instead of reading it).

Implementation notes

  • The < / == / > width decision is centralized in a single Truncate.lengthRewrite helper, shared by predicate binding and by TruncateString.project / projectStrict, so the two paths cannot diverge.
  • Removing the BoundTransform term would otherwise defeat ProjectionUtil.projectTransformPredicate (which matches partition transforms by toString()) and collapse strict projection to False. To preserve the previous precision, TruncateString now projects EXACT-length (len(v) < W) EQ/NOT_EQ predicates directly onto the partition value — provably the same result the old transform-term path produced, since truncate[W](x) == v ⟺ x == v when len(v) < W.
  • New public API is additive only (Transforms.StringTruncateRewrite enum + Transforms.stringTruncateRewrite); revapi passes with no accepted-breaks entry.

Testing

  • TestPredicateBinding: all three length classes for EQ/NOT_EQ, empty-string literal, plus negatives (non-string truncate, non-truncate transforms, other operators unchanged).
  • TestStartsWith: runtime-equivalence check via Evaluator, including the EXACT-vs-prefix distinction.
  • TestInclusiveMetricsEvaluatorWithTransforms: a pruning case that now prunes where it previously could not.
  • Full :iceberg-api:test and :iceberg-core:test, all transform/projection/residual regression suites, spotlessCheck, and :iceberg-api:revapi pass.

🤖 Generated with Claude Code

UnboundPredicate.bindLiteralOperation carried a long-standing TODO to translate truncate(col) == value into startsWith(value). This resolves it for string truncate transforms.

The exact equivalence depends on the literal length relative to the truncate width W: len(v) > W is unsatisfiable (alwaysFalse, or alwaysTrue for NOT_EQ); len(v) == W is equivalent to col STARTS_WITH v; len(v) < W is equivalent to col == v. EQ and NOT_EQ are handled. Integer, long, decimal and binary truncate, and all other operators, are left unchanged because they have no exact source-column equivalence.

This rewrite was already assumed by the rest of the engine: InclusiveMetricsEvaluator.startsWith() returns ROWS_MIGHT_MATCH for non-identity transform terms with the comment "truncate must be rewritten in binding". Producing a predicate on the untransformed column lets metrics, dictionary and partition pruning use the column directly instead of an opaque transform term.

The width/length decision is centralized in a single Truncate.lengthRewrite helper shared by predicate binding and by TruncateString.project / projectStrict so the two cannot diverge. TruncateString now also projects EXACT-length EQ/NOT_EQ predicates directly onto the partition value, preserving the precision the previous BoundTransform-term projection path provided.

Tests cover all three length classes at the binding level, a runtime-equivalence check via Evaluator, and a metrics-pruning case that now prunes where it previously could not; they also confirm non-string truncate, non-truncate transforms and other operators are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the api-rewrite-string-truncate-equality branch from d898d81 to e78b03e Compare May 16, 2026 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant