perf: optimise appendCompact usage#262
Conversation
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <danny@coder.com>
304fb9f to
88a4315
Compare
|
@dtmeadows for visibility |
|
Thanks @dannykopping! Taking a look internally now to get a better understanding of this one. |
|
Hey @dtmeadows, have y'all had a chance to look into this? |
|
@dannykopping We profiled our application under load and can independently confirm the problem described in this PR. CPU profiles show appendCompact and the associated stateInString scanner accounting for ~55% of total CPU during JSON serialization. The recursive MarshalJSON → appendCompact chain re-scans each nested type's output at every level, making the cost grow with nesting depth × payload size. Would appreciate this getting reviewed. 🙏 |
|
@dtmeadows for viz, could you please have a look? |
|
Sorry for the delay everyone! I've opened up a PR internally to hopefully get this adopted. I'll update this issue once I have more to share. |
Skip appendCompact calls for MarshalJSON output in the internal JSON encoder. The appendCompact function scans every byte to validate and compact JSON, which is O(n) per nested MarshalJSON call. For deeply nested structures, this becomes a significant bottleneck as the same bytes get re-scanned at each nesting level.
There's no need to do HTML escaping again; shims.EscapeHTMLByDefault = true and all MarshalJSON implementations use Marshal() internally, which already does HTML escaping.
For our use of this lib, this func was the biggest CPU hog when encoding JSON payloads at scale.
This is safe because MarshalJSON implementations (both in this SDK and standard library) already produce valid, compact JSON - running appendCompact on the output is redundant work.
goos: linux goarch: amd64 pkg: github.com/anthropics/anthropic-sdk-go/internal/encoding/json cpu: AMD Ryzen 9 7900 12-Core Processor │ /tmp/old │ /tmp/new │ │ sec/op │ sec/op vs base │ MarshalNestedMarshalJSON-24 5.577µ ± 9% 2.660µ ± 13% -52.31% (p=0.000 n=10) MarshalSliceOfNestedMarshalJSON-24 301.5µ ± 6% 137.1µ ± 9% -54.54% (p=0.000 n=10) geomean 41.01µ 19.09µ -53.44% │ /tmp/old │ /tmp/new │ │ B/op │ B/op vs base │ MarshalNestedMarshalJSON-24 914.0 ± 0% 913.0 ± 0% -0.11% (p=0.000 n=10) MarshalSliceOfNestedMarshalJSON-24 49.41Ki ± 0% 49.28Ki ± 0% -0.26% (p=0.000 n=10) geomean 6.641Ki 6.629Ki -0.19% │ /tmp/old │ /tmp/new │ │ allocs/op │ allocs/op vs base │ MarshalNestedMarshalJSON-24 12.00 ± 0% 12.00 ± 0% ~ (p=1.000 n=10) ¹ MarshalSliceOfNestedMarshalJSON-24 552.0 ± 0% 552.0 ± 0% ~ (p=1.000 n=10) ¹ geomean 81.39 81.39 +0.00% ¹ all samples are equalFor our project, under scale testing this fix reduced CPU usage by ~36%.
I enhanced test coverage and added benchmarks.
Disclaimer: produced alongside Opus 4.5.