Keep abbreviate and truncate off surrogate pair boundaries#1719
Conversation
|
@alhudz Please address #1718 (review) |
garydgregory
left a comment
There was a problem hiding this comment.
@alhudz
Thank you for your PR.
Please see my 2 comments.
| assertEquals(grin, StringUtils.truncate("a" + grin + "b", 1, 2)); | ||
| assertEquals("ab", StringUtils.truncate("ab" + grin, 0, 3)); | ||
| // an offset that lands inside a pair skips the orphaned low surrogate | ||
| assertEquals("a", StringUtils.truncate(grin + "ab", 1, 2)); |
There was a problem hiding this comment.
You're missing assertions for an input string that only contains the grin, with permutations of the other parameters.
There was a problem hiding this comment.
Added. truncate(grin, ...) is now covered across offset/maxWidth permutations: the whole pair when it fits (0,2, 0,3), and empty when the cut or the offset lands inside it (0,0, 0,1, 1,1, 1,2, 2,2). Never a lone surrogate.
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
You're missing assertions for an input string that only contains the grin, with permutations of the other parameters.
There was a problem hiding this comment.
Done. Added abbreviate(grin, ...) across the width, offset and marker permutations. The pair is kept whole in the normal cases, and the empty-marker case (grin, "", 0, 1) exercises the head cut backing off to empty instead of emitting a lone surrogate.
|
The 1718 review is already handled, I pushed the |
garydgregory
left a comment
There was a problem hiding this comment.
Hello @alhudz
Thank you for this PR.
With this PR applied locally, and with the assertions in org.apache.commons.lang3.StringUtilsAbbreviateTest.testEmoji() commented in, that test fails.
Would you please review testEmoji() failures in light of this PR.
Thank you!
Repro:
StringUtils.abbreviate("😀abcdef", 4)returns a lone high surrogate followed by..., andStringUtils.truncate("a😀b", 0, 2)returnsaplus a lone high surrogate.Cause: both choose the cut position in
charunits and slice withString.substring, so a boundary that lands between the two halves of a surrogate pair leaves an unpaired surrogate behind.Fix: nudge the boundary to the nearest code-point edge before slicing, so a pair is never split. The result still never exceeds
maxWidth.