Add robustness loops: quality validation, consensus, and critic-judge refinement#64
Merged
Merged
Conversation
…ch, Q&A, refine, plateau adaptation Adds five reinforcing robustness layers around the existing dispatch chokepoint in AgentRunner.run() and the workflow stage runner, all gated by a new `robustness` config block with kill switches. * Layer 1 — auto-consensus + output-quality re-dispatch (agent.py + new response_quality.py). Every non-meta dispatch is now scored after the provider returns; empty / malformed / contract-violating / low-confidence responses are re-dispatched with a repair preamble, escalating to a stochastic consensus fan-out on the final retry. Stages marked `careful: true` (or all stages when set to "always") fan out into N samples proactively without needing the agent to emit PROPOSE_CONSENSUS. * Layer 2 — inter-agent Q&A protocol (blackboard.py + agent.py). POST: question blocks now carry TO/URGENCY; blocking questions are routed inline to the target peer so the asker effectively sees the answer in subsequent rounds. Caps Q&A chain depth via robustness.max_qa_depth. * Layer 3 — critic-judge refinement loop (workflow.py). New `refine:` stage attribute runs draft → critic → revise until the critic accepts or max_rounds is hit, with auto_refine triggering the same for `careful: true` stages by default. * Layer 4 — adaptive Ralph + autoresearch (ralph_loop.py, autoresearch_loop.py). Both loops now guard against malformed planner output (skip the iteration rather than commit garbage), detect plateau / regression, escalate via consensus, reframe via the chief, and terminate gracefully when escalation can't break the plateau. Autoresearch gains an evaluator-gated rollback parallel to Ralph's. * Layer 5 — prompt updates (templates/prompts.py). Adds CONFIDENCE / NEEDS_REVIEW footer to every base agent and teaches Q&A + plateau awareness in the shared blackboard protocol and ralph.md. 29 new tests cover response-quality scoring, POST: question parsing with TO/URGENCY, find_unanswered_questions, workflow YAML parsing of refine:, quality-retry firing on empty responses, retry capping at max_quality_retries, meta-phase bypass of the quality loop, and the auto-consensus mode matrix (off / on_careful / always). All 180 existing tests continue to pass. https://claude.ai/code/session_01XYLA51b49wz4S2THibSjaF
…tency test Docs-everywhere pass that completes the robustness-loops work. Anyone cloning the repo can now read the README + the install scripts and know what each loop does, how it triggers, and how to dial it up/down — including the prior knobs (provider retry, supervise cycles, consensus caps, recovery, meta-prompting) that had no single home in the docs before. README.md * New "## Robustness loops" section (after "## Loops") that walks through every loop in order — provider retry, stage retry, supervise cycles, recovery on unmet deps, review/checkpoint stages, auto-quality re-dispatch, stochastic consensus (opt-in + automatic), inter-agent Q&A, critic-judge refinement, adaptive Ralph & autoresearch — each with the YAML/config snippet that tunes it, the activity-log event name to grep for, and a kill switch. * New "Putting it together" subsection traces a careful stage end to end so the user can see all layers compose. * "What's new" gets a top-of-file changelog entry summarising the layers and pointing at the new section. * "## Self-healing on unmet deps" cross-links the new section (dependency vs. content failures). * "## Dynamic agents (casting)" mentions the new Q&A POST type. * "## Cost guardrails" grows a "Robustness-loop multipliers" subsection with a worst-case-cost table per knob plus cost-minimal and cost-maximal config recipes. .env.example * New "Robustness loops" block with CLK_ROBUSTNESS_* lines for every knob: AUTO_CONSENSUS, AUTO_REFINE, MAX_QUALITY_RETRIES, MIN_RESPONSE_CHARS, REFINE_MAX_ROUNDS, REFINE_ACCEPT_THRESHOLD, QA_PARALLEL_JUDGES, MAX_QA_DEPTH, PLATEAU_WINDOW, PLATEAU_ACTION. * New "Prior-knob reference" block documenting the legacy knobs that the docs claimed parity for but had no single source: provider timeouts and retry policy, supervise cycles, consensus caps, casting cap, auto-commit, validation batch caps, meta-prompt mode, review per-stage, recovery cap. kickoff.sh * After `clk init`, a new Python block reads every CLK_ROBUSTNESS_* and prior CLK_* env var and writes it into .clk/config/clk.config.json via dotted-path assignment. Unset vars fall through to DEFAULT_CLK_CONFIG; partially-set envs are honored. Header comment enumerates the full env-var surface so future contributors can find it. scripts/install_local.sh * Header expanded from ~13 lines to a self-contained narrative that describes (a) what the script does, (b) the .clk/ directory layout it creates, (c) all three install strategies with their fallbacks, (d) the extras-group syntax, (e) what the script does NOT install (provider CLIs, telegram, docker, github) with pointers to the README sections that cover each, (f) the related entry points (scripts/clk, install_tool.sh, run_loop.sh, kickoff.sh). scripts/clk, install_tool.sh, run_loop.sh * Header cross-references added so any one of them lands a reader on the canonical install procedure (install_local.sh) and the README sections (Loops, Robustness loops, Provider and authentication). tests/test_docs_consistency.py * New mechanical assertions that the four sources stay aligned: DEFAULT_CLK_CONFIG['robustness'] keys ↔ CLK_ROBUSTNESS_* lines in .env.example ↔ env-var mapping in kickoff.sh ↔ README mentions in the Robustness-loops + Cost-guardrails sections. Plus parity checks for the prior-knob inventory and the install-script header content. Eight tests, all passing — adding a new robustness knob in the future requires touching all four files or the suite fails. All 188 unit tests pass. https://claude.ai/code/session_01XYLA51b49wz4S2THibSjaF
- README ## Docker section: add a callout block explaining that CLK_RUN_INSTALL should stay false (the default) in Docker because the Dockerfile already installs all deps at build time. - README ## First-run setup wizard: expand the 'Loop settings' bullet to explain what the install flag does and explicitly say to leave it false in Docker. - kickoff.sh setup wizard: extend the _sv_explain text for the 'run install' prompt to tell users running in Docker that false is correct because the image ships with deps pre-installed. https://claude.ai/code/session_01XYLA51b49wz4S2THibSjaF
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a comprehensive robustness framework that wraps agent dispatches with three layers of quality assurance and adaptive retry logic:
careful: true(or all stages when configured) fan into N parallel samples with chief coalescingAdditionally, the Ralph and autoresearch loops now detect plateaus and regressions, escalating to consensus and reframing rather than burning the full iteration budget.
Key Changes
Core Robustness Layers
response_quality.py(new) — Validator module that detects:ResponseQualityverdict with recoverable/non-recoverable distinction and repair hintsagent.py— Refactored dispatch entry point:run()now wraps_dispatch_once()with quality and consensus layers_dispatch_with_quality_loop()— retries failed responses with repair preambles, escalates to consensus on final retry_dispatch_auto_consensus()— fans out to N samples + chief coalescing_should_auto_consensus()— proactive fan-out decision logic_META_PHASESfrozenset prevents recursion (consensus, checkpoint, recovery, critic, etc. never re-trigger loops)workflow.py— Critic-judge refinement:WorkflowStage.refinefield for explicit critic configuration_refine_enabled()— decides when to run critic based on stage config orrobustness.auto_refinepolicy_refine_loop()— dispatches critic, scores response, re-dispatches worker with feedback up to max_rounds_run_stage()after initial dispatchAdaptive Loop Control
ralph_loop.py— Plateau and regression detection:_progress_verdict()— classifies iterations as improving/plateau/regressing_handle_plateau()— escalates to consensus fan-out, then reframes workflow_should_terminate_for_plateau()— graceful termination when stuck_adaptive_extra()— injectscareful=trueinto next iteration's dispatches when escalatingautoresearch_loop.py— Similar plateau detection for experiment loopsBlackboard & Q&A Protocol
blackboard.py— ExtendedPostclass:target_agentandurgencyfields for directed questionsfind_unanswered_questions()filters by target agenturgency: blockinganswered inline;asyncsurfaced to chiefConfiguration & Documentation
config.py— Newrobustnessblock inDEFAULT_CLK_CONFIG:auto_consensus(off | on_careful | always)auto_refine(off | careful_only | all)max_quality_retries,min_response_charsrefine_max_rounds,refine_accept_thresholdmax_qa_depth,plateau_window,plateau_action.env.example— Documented allCLK_ROBUSTNESS_*env vars with cost multiplierskickoff.sh— Added env-var override block that mapsCLK_ROBUSTNESS_*intoclk.config.jsonREADME.md— Comprehensive sections:https://claude.ai/code/session_01XYLA51b49wz4S2THibSjaF