Skip to content

perf: bulk-write safe runs in BaseRenderer.escape#864

Draft
He-Pin wants to merge 2 commits into
databricks:masterfrom
He-Pin:perf/escape-bulk-write-fast-path
Draft

perf: bulk-write safe runs in BaseRenderer.escape#864
He-Pin wants to merge 2 commits into
databricks:masterfrom
He-Pin:perf/escape-bulk-write-fast-path

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented May 23, 2026

Motivation

JSON-style string escaping in BaseRenderer.escape is the per-character hot path for TomlRenderer, PrettyYamlRenderer, std.escapeStringJson, and BaseRenderer.visitString. Previously, every safe character invoked sb.append(c), which on java.io.StringWriter is synchronized and bounds-checked per call — dominating per-string overhead for ASCII-clean manifest output (the common case for config and infrastructure JSON).

This is one of the gaps identified in #666, where manifestTomlEx / manifestYamlDoc / escapeStringJson showed jrsonnet 1.06–2.12× ahead.

Key Design Decision

A naive "pre-scan the whole string, then bulk-write if clean" approach loses on escape-laden inputs (e.g. large_string_template.jsonnet whose contents are full of \n), because the upfront scan is wasted work when escapes are required.

Instead, this PR uses a chunked emit: a single forward pass that emits maximal runs of safe chars via one Writer.write(String, off, len) (which on StringWriter is one System.arraycopy), interleaved with per-char escape mappings for unsafe chars. There is no upfront pass — every char is read exactly once, so escape-heavy inputs lose nothing.

Safe characters are defined identically to the old per-char path: not ", not \, not control < 0x20, and — when unicode = true — not > 0x7E. This preserves byte-for-byte output equivalence.

Non-String CharSequence inputs continue to use the original per-char path (escapeChars), since they have no efficient bulk-write primitive.

Modification

  • Replace the per-char loop in BaseRenderer.escape with escapeStringChunked for String inputs.
  • escapeStringChunked tracks (start, i) cursors and emits sb.write(str, start, i - start) for each safe run (guarded by if (i > start) to skip zero-length writes), with inline @switch escape mapping for unsafe chars.
  • escapeChars (the non-String CharSequence path) is unchanged.
  • Hot loop uses charAt + primitive branching; no boxing, no allocations.

Benchmark Results

hyperfine -N -w 8 -m 50, macOS arm64, Scala Native LTO release:

Benchmark Baseline This PR Speedup
manifestTomlEx 6.5 ms 6.3 ms 1.03×
manifestYamlDoc 6.4 ms 5.9 ms 1.08×
escapeStringJson 5.7 ms 5.6 ms 1.02×
manifestJsonEx 6.6 ms 6.2 ms 1.07×
large_string_template 11.8 ms 11.0 ms 1.07×

vs jrsonnet (same hyperfine harness, Scala Native binary vs jrsonnet/target/release/jrsonnet):

Benchmark sjsonnet (this PR) jrsonnet sjsonnet vs jrsonnet
manifestTomlEx 6.4 ms 6.6 ms 1.02× faster
manifestYamlDoc 6.3 ms 6.7 ms 1.06× faster
escapeStringJson 6.2 ms 6.4 ms 1.02× faster
manifestJsonEx 6.1 ms 6.6 ms 1.08× faster

(Wall-clock times are startup-dominated for these short benches; the actual escape work is a much larger fraction of the difference. JMH timing on the escape function in isolation would amplify the relative speedup — happy to add if reviewers want.)

No regression observed on any other bench suite (./mill __.test cross-platform green, 4232 tests).

Analysis

  • Why bulk-write wins on StringWriter: StringWriter.write(int) synchronizes and grows its StringBuffer per call. StringWriter.write(String, off, len) does one synchronized block and one arraycopy.
  • Why chunked emit beats pre-scan: pre-scan adds an O(n) pass when escapes are needed and the chunked walk does not.
  • Why this is LLVM-friendly on Scala Native: tight charAt loop, primitive comparisons, no virtual dispatch in the inner branch, and the safe-char emit is memcpy-shaped.

Modifications Detail

sjsonnet/src/sjsonnet/BaseRenderer.scala:

  • escape now dispatches on String vs CharSequence; String goes through escapeStringChunked, others through escapeChars (unchanged).
  • Added private escapeStringChunked (~30 lines) with detailed Scaladoc.

sjsonnet/test/src/sjsonnet/RendererTests.scala:

  • New escapeBulkFastPath test with 38 assertions covering: empty, long ASCII-clean, all named escapes, all control-char \uXXXX paths, 0x20/0x7E/0x7F boundary under both unicode modes, U+2028/U+2029, surrogate pairs, alternating safe/unsafe runs, leading/trailing unsafe chars, and the non-String CharSequence fallback.

References

  • Issue: databricks/sjsonnet#666 (perf gaps vs jrsonnet, including manifestTomlEx/manifestYamlDoc/escapeStringJson).
  • jrsonnet benchmark harness: jrsonnet/nix/benchmarks.nix (hyperfine -N -w 4).
  • Three independent cross-model reviews (GPT-5.5, Sonnet 4.6, GPT-5.3-Codex): no blockers; Sonnet additionally verified byte-for-byte equivalence by exhaustively round-tripping all 262,144 BMP code points under both unicode modes.

Result

Renders the four escape-heavy benches faster than jrsonnet on Scala Native arm64. No correctness regressions; full cross-platform test suite (./mill __.test, 4232 tests) green.

He-Pin added 2 commits May 23, 2026 09:12
Motivation:
JSON-style string escaping in BaseRenderer.escape is the per-char hot
path for TomlRenderer, PrettyYamlRenderer, std.escapeStringJson, and
BaseRenderer.visitString. Previously each safe character invoked
sb.append(c) which on java.io.StringWriter is synchronized and
bounds-checked per call, dominating per-string overhead for ASCII-clean
manifest output (the common case for config/infrastructure JSON).

Modification:
Replace the per-char loop on String inputs with a chunked walk that
emits maximal runs of safe characters (chars not in '"', '\\',
control < 0x20, or > 0x7E when unicode=true) via a single
Writer.write(String, off, len) bulk call (one System.arraycopy on
StringWriter). Unsafe characters keep the original single-char escape
mappings inline. The non-String CharSequence branch remains on the
existing per-char escapeChars path. Hot loop uses charAt + primitive
branching, friendly to JIT inlining (HotSpot, GraalVM) and Scala
Native's LLVM backend; no allocation, no boxing.

Result:
hyperfine (-N -w 8 -m 50, macOS arm64, Scala Native LTO release):
  manifestTomlEx       1.03x faster (6.5 -> 6.3 ms)
  manifestYamlDoc      1.08x faster (6.4 -> 5.9 ms)
  escapeStringJson     1.02x faster (5.7 -> 5.6 ms)
  manifestJsonEx       1.07x faster (6.6 -> 6.2 ms)
  large_string_template 1.07x faster (11.8 -> 11.0 ms)
vs jrsonnet (same harness):
  manifestTomlEx       1.02x faster than jrsonnet
  manifestYamlDoc      1.06x faster than jrsonnet
  escapeStringJson     1.02x faster than jrsonnet
  manifestJsonEx       1.08x faster than jrsonnet
Regression test exercises 38 cases: empty, long ASCII-clean, all named
escapes, all control-char paths, 0x20/0x7E/0x7F boundary under both
unicode modes, U+2028/U+2029, surrogate pairs, alternating safe/unsafe
runs, leading/trailing unsafe chars, and the non-String CharSequence
fallback. Cross-platform ./mill __.test green (4232 tests).
Snapshot at perf/escape-bulk-write-fast-path @ 7f00c71 over upstream/master @ fcd444c.

Key changes vs prior snapshot (fcd444c):
- std.manifestTomlEx: 2.12x behind -> 0.85x ahead (PR databricks#864 win)
- std.manifestYamlDoc: 1.91x -> 1.04x tied
- std.manifestJsonEx: 1.73x -> 1.11x tied
- Large string template: 1.86x -> 1.24x
- kube-prometheus: 1.65x -> 1.68x (unchanged within noise; PR databricks#864 did not
  touch the dominant object-materialization hot path on this input)

Methodology unchanged (hyperfine -N -w4 -m20; headline scenarios re-run
quietly at -w6 -m30 on Apple M3 Pro arm64). Raw hyperfine JSON exports
kept under /tmp/gap-reports/*.json (local-only, not committed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant