Skip to content

Keep abbreviate and truncate off surrogate pair boundaries#1719

Open
alhudz wants to merge 2 commits into
apache:masterfrom
alhudz:abbreviate-truncate-surrogate-split
Open

Keep abbreviate and truncate off surrogate pair boundaries#1719
alhudz wants to merge 2 commits into
apache:masterfrom
alhudz:abbreviate-truncate-surrogate-split

Conversation

@alhudz

@alhudz alhudz commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Repro: StringUtils.abbreviate("😀abcdef", 4) returns a lone high surrogate followed by ..., and StringUtils.truncate("a😀b", 0, 2) returns a plus a lone high surrogate.
Cause: both choose the cut position in char units and slice with String.substring, so a boundary that lands between the two halves of a surrogate pair leaves an unpaired surrogate behind.
Fix: nudge the boundary to the nearest code-point edge before slicing, so a pair is never split. The result still never exceeds maxWidth.

@garydgregory garydgregory changed the title keep abbreviate and truncate off surrogate pair boundaries Keep abbreviate and truncate off surrogate pair boundaries Jun 19, 2026
@garydgregory

Copy link
Copy Markdown
Member

@alhudz Please address #1718 (review)

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alhudz
Thank you for your PR.
Please see my 2 comments.

assertEquals(grin, StringUtils.truncate("a" + grin + "b", 1, 2));
assertEquals("ab", StringUtils.truncate("ab" + grin, 0, 3));
// an offset that lands inside a pair skips the orphaned low surrogate
assertEquals("a", StringUtils.truncate(grin + "ab", 1, 2));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing assertions for an input string that only contains the grin, with permutations of the other parameters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. truncate(grin, ...) is now covered across offset/maxWidth permutations: the whole pair when it fits (0,2, 0,3), and empty when the cut or the offset lands inside it (0,0, 0,1, 1,1, 1,2, 2,2). Never a lone surrogate.

}
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing assertions for an input string that only contains the grin, with permutations of the other parameters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added abbreviate(grin, ...) across the width, offset and marker permutations. The pair is kept whole in the normal cases, and the empty-marker case (grin, "", 0, 1) exercises the head cut backing off to empty instead of emitting a lone surrogate.

@alhudz

alhudz commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

The 1718 review is already handled, I pushed the POSITIVE_INFINITY/NEGATIVE_INFINITY tests for double and float there. For this PR I added the grin-only assertions across the parameter permutations and replied inline on both threads.

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @alhudz

Thank you for this PR.

With this PR applied locally, and with the assertions in org.apache.commons.lang3.StringUtilsAbbreviateTest.testEmoji() commented in, that test fails.

Would you please review testEmoji() failures in light of this PR.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants