API: Rewrite string truncate equality predicates onto the source column#16362
Open
wombatu-kun wants to merge 1 commit into
Open
API: Rewrite string truncate equality predicates onto the source column#16362wombatu-kun wants to merge 1 commit into
wombatu-kun wants to merge 1 commit into
Conversation
UnboundPredicate.bindLiteralOperation carried a long-standing TODO to translate truncate(col) == value into startsWith(value). This resolves it for string truncate transforms. The exact equivalence depends on the literal length relative to the truncate width W: len(v) > W is unsatisfiable (alwaysFalse, or alwaysTrue for NOT_EQ); len(v) == W is equivalent to col STARTS_WITH v; len(v) < W is equivalent to col == v. EQ and NOT_EQ are handled. Integer, long, decimal and binary truncate, and all other operators, are left unchanged because they have no exact source-column equivalence. This rewrite was already assumed by the rest of the engine: InclusiveMetricsEvaluator.startsWith() returns ROWS_MIGHT_MATCH for non-identity transform terms with the comment "truncate must be rewritten in binding". Producing a predicate on the untransformed column lets metrics, dictionary and partition pruning use the column directly instead of an opaque transform term. The width/length decision is centralized in a single Truncate.lengthRewrite helper shared by predicate binding and by TruncateString.project / projectStrict so the two cannot diverge. TruncateString now also projects EXACT-length EQ/NOT_EQ predicates directly onto the partition value, preserving the precision the previous BoundTransform-term projection path provided. Tests cover all three length classes at the binding level, a runtime-equivalence check via Evaluator, and a metrics-pruning case that now prunes where it previously could not; they also confirm non-string truncate, non-truncate transforms and other operators are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d898d81 to
e78b03e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Resolves the long-standing
TODO: translate truncate(col) == value to startsWith(value)inUnboundPredicate.bindLiteralOperation. When the term of anEQ/NOT_EQpredicate is a stringtruncate[W]transform, binding now produces an exactly-equivalent predicate on the untransformed source column.The equivalence depends on the literal length vs. the truncate width
W:truncate[W](col) == vtruncate[W](col) != vlen(v) > WalwaysFalse()alwaysTrue()len(v) == Wcol STARTS_WITH vcol NOT_STARTS_WITH vlen(v) < Wcol == vcol != vInteger/long/decimal/binary truncate and all other operators are intentionally left unchanged — they have no exact source-column equivalence.
Why
This rewrite is already assumed by the rest of the engine.
InclusiveMetricsEvaluator.startsWith()returnsROWS_MIGHT_MATCHfor non-identity transform terms with the explicit comment "truncate must be rewritten in binding". Until now the binder never performed that rewrite for equality, sotruncate(col) == vkept an opaqueBoundTransformterm and metrics/dictionary/partition pruning could not use the column. After this change such predicates prune correctly (e.g.equal(truncate("str",3),"xyz")against bounds["abc","abe"]now skips the file instead of reading it).Implementation notes
< / == / > widthdecision is centralized in a singleTruncate.lengthRewritehelper, shared by predicate binding and byTruncateString.project/projectStrict, so the two paths cannot diverge.BoundTransformterm would otherwise defeatProjectionUtil.projectTransformPredicate(which matches partition transforms bytoString()) and collapse strict projection toFalse. To preserve the previous precision,TruncateStringnow projects EXACT-length (len(v) < W)EQ/NOT_EQpredicates directly onto the partition value — provably the same result the old transform-term path produced, sincetruncate[W](x) == v ⟺ x == vwhenlen(v) < W.Transforms.StringTruncateRewriteenum +Transforms.stringTruncateRewrite);revapipasses with no accepted-breaks entry.Testing
TestPredicateBinding: all three length classes forEQ/NOT_EQ, empty-string literal, plus negatives (non-string truncate, non-truncate transforms, other operators unchanged).TestStartsWith: runtime-equivalence check viaEvaluator, including the EXACT-vs-prefix distinction.TestInclusiveMetricsEvaluatorWithTransforms: a pruning case that now prunes where it previously could not.:iceberg-api:testand:iceberg-core:test, all transform/projection/residual regression suites,spotlessCheck, and:iceberg-api:revapipass.🤖 Generated with Claude Code