Port response-quality scoring and consensus fan-out to TypeScript#66
Merged
Conversation
…arness Two of the recent CLK harness PRs have a direct parallel in pi-extension: * push-on-commit + ahead counter (756723c). pi-extension already commits every clk_checkpoint / clk_merge call, but never pushes — a remote-backed Pi workspace silently accumulated local commits. - src/git.ts: hasRemote, commitsAhead, pushBestEffort (best-effort, never throws; mirrors clk_harness/git_ops.py). - src/tools.ts: pushIfEnabled helper called after clk_checkpoint and clk_merge. Gated on CLK_GITHUB_PUSH_ON_COMMIT=true to match the Python TUI; surfaces an ↑N ahead count on push failure or when auto-push is disabled but commits exist. - src/index.ts: /clk-doctor now reports the ahead count and warns when local commits haven't reached origin. * multi-line objective truncation (24f379b). idea.slice(0, 60) was being done before splitting on newlines, so a multi-line idea could leak a fragment of line 2 into the status bar. - src/index.ts: new firstLineShort helper, used at every ctx.ui.setStatus("clk-idea", …) site and in /clk-doctor. Tests: tests/git.test.ts covers no-remote/sync/unreachable cases for pushBestEffort and commitsAhead. tests/index.test.ts asserts firstLineShort returns single-line, capped output for multi-line input.
… Ralph
Ports the Python harness's orchestration loops into the TypeScript
extension so the chief can drive real code-enforced fan-out instead of
having to fan-out by emitting parallel clk_subagent calls and hoping it
followed the prompt.
src/quality.ts (new)
Port of clk_harness/orchestration/response_quality.py. Pure regex /
string scorer — no I/O, no provider calls. Detects empty bodies,
refusal phrases, malformed ACTION / POST blocks, missing declared
POST PRODUCES keys, low CONFIDENCE: <n> values, and NEEDS_REVIEW:
true. Exposes scoreResponse, repairHint, isRecoverable, summarise.
src/consensus.ts (new)
Two primitives, both with an injectable spawn function so tests can
drive them without tmux / pi installed:
* dispatchWithQuality — wraps a single spawnSubagent in the
quality re-dispatch loop. Re-runs with a repair-preamble
preface on every recoverable failure up to maxRetries.
* runConsensus — fan-out N parallel tmux samples for the same
task, score each, return all + the winner. Pool runner caps
concurrent in-flight sessions via maxParallel.
src/subagent.ts
Exposes spawnSubagent + SpawnOptions so consensus.ts can call them.
Behaviour unchanged.
src/tools.ts (+428 LOC)
Four new tools registered alongside the existing roster:
* clk_subagent_quality — one subagent + quality re-rolls.
* clk_consensus — N samples, scored, winner returned.
* clk_autoresearch — researcher + critic alternation
(iterations are recorded on progress.md).
* clk_ralph — branch + consensus fan-out in one call;
the chief then calls clk_merge or
clk_revert based on validation.
Each tool surfaces a structured details payload so the chief sees
scores, attempts, and flags rather than just the winning text.
src/prompts.ts
Updated chief primer to direct the chief through the new tools
(Dispatch tool quick reference, restated rules 3, 4, 5A). The old
"emit 3-5 clk_subagent calls in the same message" guidance is
replaced by "call clk_consensus" so fan-out is enforced in code,
not by chief compliance.
src/index.ts
/clk-help lists every orchestration tool and notes the
CLK_GITHUB_PUSH_ON_COMMIT auto-push behaviour landed in the prior
commit.
Tests: 24 new tests across quality.test.ts (happy paths, every failure
mode, repairHint / isRecoverable / summarise) and consensus.test.ts
(injected spawn covers ok / retry / max-retries / non-recoverable
refusal / fan-out winner picking / sample clamping / error capture /
maxParallel concurrency). index.test.ts and prompts.test.ts updated to
assert the new tools are registered and named in the chief primer.
All 94 tests pass, typecheck clean.
…, doctor
Updates both READMEs to reflect the orchestration work that just landed
in pi-extension and the recent main-line PRs (push-on-commit, doctor /
diag CLI, multi-line truncation fix) that already shipped to master but
weren't fully cross-referenced.
pi-extension/README.md (full rewrite, +293 net lines)
* Replaces the "8 small tools" narrative with a proper Tool Reference
that groups roster / dispatch / iterative-refinement and explains
when to pick clk_subagent vs clk_subagent_quality vs clk_consensus
vs clk_autoresearch vs clk_ralph.
* New "Response-quality scoring" section listing every flag the
detector raises and how the repair-preamble loop quotes them back
to the worker. Cross-references the Python harness's
response_quality.py so behaviour drift between the two
implementations is one diff away from being noticed.
* New "Auto-push (opt-in)" section covering CLK_GITHUB_PUSH_ON_COMMIT,
the ↑N ahead counter, and the pre-push secret-scanner interaction.
* Commands table extended with /clk-help, /clk-doctor, /clk-undo
(these existed in the code but the README only listed /clk and
/clk-abort).
* "What you keep / what changes" tables rewritten: stochastic
consensus, quality re-dispatch, and Ralph refinement are now
described as code-enforced (not chief-compliance dependent), and
the comparison row about robustness loops names the new tools as
the per-call equivalents of the Python harness's
clk.config.json::robustness.* knobs.
* Repository layout updated with src/quality.ts, src/consensus.ts,
the new test files, and explicit per-file purposes.
* "Testing" section reflects the real 96-test count and notes the
suite runs entirely offline (consensus tests inject a fake spawn).
README.md (main) — targeted updates
* Pi extension section: brief but accurate rundown of the new
orchestration tools, a Commands table that matches /clk-help, the
CLK_GITHUB_PUSH_ON_COMMIT env var, and an updated example
transcript that uses clk_consensus / clk_autoresearch / clk_ralph
by name rather than the "fans out to 3 subagents" abstraction.
* Layout section: pi-extension/ subtree expanded to show every src/
file with a one-line purpose, including the new quality.ts and
consensus.ts.
* Testing section: pi-extension test count corrected from 53 to 96
(~1s → ~2s), and the per-suite description rewritten to name the
new modules (quality / consensus / git auto-push helpers /
firstLineShort) so a contributor browsing the README knows what
is and isn't covered.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR ports the Python harness's response-quality scoring and stochastic consensus logic into the Pi extension as real, enforceable tools. Previously, orchestration policy lived only in the chief's prompt; now it's enforced in code so the chief cannot accidentally skip critical steps like parallel sampling, quality validation, or Ralph branch creation.
Key Changes
Response-quality scorer (
src/quality.ts): TypeScript port ofclk_harness/orchestration/response_quality.py. Detects empty responses, refusals, malformed blocks, low confidence, and missing declared outputs. Generates repair hints for recoverable failures so re-rolls fix specific issues rather than re-rolling at random.Consensus and quality-dispatch primitives (
src/consensus.ts):dispatchWithQuality()— wraps a single subagent dispatch with automatic quality re-dispatch loop (up tomaxRetriesattempts with repair preambles)runConsensus()— fans out N parallel tmux subagent samples, scores each, returns the highest-scoring winner plus all candidates for traceabilityNew orchestration tools (
src/tools.ts):clk_consensus— fan-out N parallel samples (default 3, clamped 1..6) for high-stakes decisionsclk_subagent_quality— single subagent with quality gate and automatic repair re-rollsclk_autoresearch— bounded researcher + critic alternation (default 2 iterations)clk_ralph— one-call Ralph iteration: creates branch, fans out consensus, returns winner (branch creation and fan-out happen in one step and cannot be skipped)Git push integration (
src/git.ts):hasRemote()— check if a remote existscommitsAhead()— count local commits not yet on upstreampushBestEffort()— best-effortgit pushthat never throws, returns success/failure with reasonpushIfEnabled()helper in tools.ts auto-pushes onCLK_GITHUB_PUSH_ON_COMMIT=trueand surfaces↑Nahead count in status barUpdated chief primer (
src/prompts.ts): New dispatch tool quick reference explaining when to use each tool (subagent vs. quality vs. consensus vs. autoresearch vs. ralph).Comprehensive test coverage:
tests/quality.test.ts— unit tests for scoring logic (mirrors Python harness tests)tests/consensus.test.ts— tests for quality-dispatch loop and consensus fan-out with injectable spawn functiontests/git.test.ts— tests for remote/push/ahead helperstests/index.test.tsandtests/prompts.test.tsto verify new tools are registeredDocumentation (
pi-extension/README.md): Expanded with tool reference section, quality-scoring rules, and clarification that orchestration policy is now enforced in code, not just in the prompt.Notable Implementation Details
maxParallelto limit concurrent tmux sessions (default min(4, samples)).https://claude.ai/code/session_012nKhcka2fhuazbVbhQpRm1