Skip to content

Add robustness loops: quality validation, consensus, and critic-judge refinement#64

Merged
BillJr99 merged 3 commits into
masterfrom
claude/agent-robustness-looping-g9eBe
May 24, 2026
Merged

Add robustness loops: quality validation, consensus, and critic-judge refinement#64
BillJr99 merged 3 commits into
masterfrom
claude/agent-robustness-looping-g9eBe

Conversation

@BillJr99
Copy link
Copy Markdown
Owner

Summary

This PR introduces a comprehensive robustness framework that wraps agent dispatches with three layers of quality assurance and adaptive retry logic:

  1. Response Quality Validation — scores responses for emptiness, malformed blocks, missing outputs, and low confidence
  2. Auto-Consensus Fan-Out — stages marked careful: true (or all stages when configured) fan into N parallel samples with chief coalescing
  3. Critic-Judge Refinement — draft → critic → revise inner loop that iterates until the critic approves or max rounds reached

Additionally, the Ralph and autoresearch loops now detect plateaus and regressions, escalating to consensus and reframing rather than burning the full iteration budget.

Key Changes

Core Robustness Layers

  • response_quality.py (new) — Validator module that detects:

    • Empty or sub-threshold responses
    • Malformed ACTION/POST blocks
    • Missing declared outputs
    • Low confidence or needs-review flags
    • Refusal patterns
    • Returns ResponseQuality verdict with recoverable/non-recoverable distinction and repair hints
  • agent.py — Refactored dispatch entry point:

    • run() now wraps _dispatch_once() with quality and consensus layers
    • _dispatch_with_quality_loop() — retries failed responses with repair preambles, escalates to consensus on final retry
    • _dispatch_auto_consensus() — fans out to N samples + chief coalescing
    • _should_auto_consensus() — proactive fan-out decision logic
    • _META_PHASES frozenset prevents recursion (consensus, checkpoint, recovery, critic, etc. never re-trigger loops)
  • workflow.py — Critic-judge refinement:

    • WorkflowStage.refine field for explicit critic configuration
    • _refine_enabled() — decides when to run critic based on stage config or robustness.auto_refine policy
    • _refine_loop() — dispatches critic, scores response, re-dispatches worker with feedback up to max_rounds
    • Integrated into _run_stage() after initial dispatch

Adaptive Loop Control

  • ralph_loop.py — Plateau and regression detection:

    • _progress_verdict() — classifies iterations as improving/plateau/regressing
    • _handle_plateau() — escalates to consensus fan-out, then reframes workflow
    • _should_terminate_for_plateau() — graceful termination when stuck
    • _adaptive_extra() — injects careful=true into next iteration's dispatches when escalating
  • autoresearch_loop.py — Similar plateau detection for experiment loops

Blackboard & Q&A Protocol

  • blackboard.py — Extended Post class:
    • target_agent and urgency fields for directed questions
    • find_unanswered_questions() filters by target agent
    • Questions with urgency: blocking answered inline; async surfaced to chief

Configuration & Documentation

  • config.py — New robustness block in DEFAULT_CLK_CONFIG:

    • auto_consensus (off | on_careful | always)
    • auto_refine (off | careful_only | all)
    • max_quality_retries, min_response_chars
    • refine_max_rounds, refine_accept_threshold
    • max_qa_depth, plateau_window, plateau_action
  • .env.example — Documented all CLK_ROBUSTNESS_* env vars with cost multipliers

  • kickoff.sh — Added env-var override block that maps CLK_ROBUSTNESS_* into clk.config.json

  • README.md — Comprehensive sections:

    • "Robustness loops" explaining all four layers
    • "Robustness-loop multipliers"

https://claude.ai/code/session_01XYLA51b49wz4S2THibSjaF

claude added 3 commits May 24, 2026 15:06
…ch, Q&A, refine, plateau adaptation

Adds five reinforcing robustness layers around the existing dispatch chokepoint
in AgentRunner.run() and the workflow stage runner, all gated by a new
`robustness` config block with kill switches.

* Layer 1 — auto-consensus + output-quality re-dispatch (agent.py +
  new response_quality.py). Every non-meta dispatch is now scored after
  the provider returns; empty / malformed / contract-violating /
  low-confidence responses are re-dispatched with a repair preamble,
  escalating to a stochastic consensus fan-out on the final retry.
  Stages marked `careful: true` (or all stages when set to "always")
  fan out into N samples proactively without needing the agent to emit
  PROPOSE_CONSENSUS.
* Layer 2 — inter-agent Q&A protocol (blackboard.py + agent.py).
  POST: question blocks now carry TO/URGENCY; blocking questions are
  routed inline to the target peer so the asker effectively sees the
  answer in subsequent rounds. Caps Q&A chain depth via
  robustness.max_qa_depth.
* Layer 3 — critic-judge refinement loop (workflow.py). New `refine:`
  stage attribute runs draft → critic → revise until the critic
  accepts or max_rounds is hit, with auto_refine triggering the same
  for `careful: true` stages by default.
* Layer 4 — adaptive Ralph + autoresearch (ralph_loop.py,
  autoresearch_loop.py). Both loops now guard against malformed
  planner output (skip the iteration rather than commit garbage),
  detect plateau / regression, escalate via consensus, reframe via the
  chief, and terminate gracefully when escalation can't break the
  plateau. Autoresearch gains an evaluator-gated rollback parallel to
  Ralph's.
* Layer 5 — prompt updates (templates/prompts.py). Adds CONFIDENCE /
  NEEDS_REVIEW footer to every base agent and teaches Q&A + plateau
  awareness in the shared blackboard protocol and ralph.md.

29 new tests cover response-quality scoring, POST: question parsing
with TO/URGENCY, find_unanswered_questions, workflow YAML parsing of
refine:, quality-retry firing on empty responses, retry capping at
max_quality_retries, meta-phase bypass of the quality loop, and the
auto-consensus mode matrix (off / on_careful / always). All 180
existing tests continue to pass.

https://claude.ai/code/session_01XYLA51b49wz4S2THibSjaF
…tency test

Docs-everywhere pass that completes the robustness-loops work. Anyone
cloning the repo can now read the README + the install scripts and
know what each loop does, how it triggers, and how to dial it up/down
— including the prior knobs (provider retry, supervise cycles,
consensus caps, recovery, meta-prompting) that had no single home in
the docs before.

README.md
* New "## Robustness loops" section (after "## Loops") that walks
  through every loop in order — provider retry, stage retry, supervise
  cycles, recovery on unmet deps, review/checkpoint stages,
  auto-quality re-dispatch, stochastic consensus (opt-in + automatic),
  inter-agent Q&A, critic-judge refinement, adaptive Ralph &
  autoresearch — each with the YAML/config snippet that tunes it, the
  activity-log event name to grep for, and a kill switch.
* New "Putting it together" subsection traces a careful stage end to
  end so the user can see all layers compose.
* "What's new" gets a top-of-file changelog entry summarising the
  layers and pointing at the new section.
* "## Self-healing on unmet deps" cross-links the new section
  (dependency vs. content failures).
* "## Dynamic agents (casting)" mentions the new Q&A POST type.
* "## Cost guardrails" grows a "Robustness-loop multipliers" subsection
  with a worst-case-cost table per knob plus cost-minimal and
  cost-maximal config recipes.

.env.example
* New "Robustness loops" block with CLK_ROBUSTNESS_* lines for every
  knob: AUTO_CONSENSUS, AUTO_REFINE, MAX_QUALITY_RETRIES,
  MIN_RESPONSE_CHARS, REFINE_MAX_ROUNDS, REFINE_ACCEPT_THRESHOLD,
  QA_PARALLEL_JUDGES, MAX_QA_DEPTH, PLATEAU_WINDOW, PLATEAU_ACTION.
* New "Prior-knob reference" block documenting the legacy knobs that
  the docs claimed parity for but had no single source: provider
  timeouts and retry policy, supervise cycles, consensus caps, casting
  cap, auto-commit, validation batch caps, meta-prompt mode, review
  per-stage, recovery cap.

kickoff.sh
* After `clk init`, a new Python block reads every CLK_ROBUSTNESS_*
  and prior CLK_* env var and writes it into .clk/config/clk.config.json
  via dotted-path assignment. Unset vars fall through to
  DEFAULT_CLK_CONFIG; partially-set envs are honored. Header comment
  enumerates the full env-var surface so future contributors can find
  it.

scripts/install_local.sh
* Header expanded from ~13 lines to a self-contained narrative that
  describes (a) what the script does, (b) the .clk/ directory layout
  it creates, (c) all three install strategies with their fallbacks,
  (d) the extras-group syntax, (e) what the script does NOT install
  (provider CLIs, telegram, docker, github) with pointers to the
  README sections that cover each, (f) the related entry points
  (scripts/clk, install_tool.sh, run_loop.sh, kickoff.sh).

scripts/clk, install_tool.sh, run_loop.sh
* Header cross-references added so any one of them lands a reader on
  the canonical install procedure (install_local.sh) and the README
  sections (Loops, Robustness loops, Provider and authentication).

tests/test_docs_consistency.py
* New mechanical assertions that the four sources stay aligned:
  DEFAULT_CLK_CONFIG['robustness'] keys ↔ CLK_ROBUSTNESS_* lines in
  .env.example ↔ env-var mapping in kickoff.sh ↔ README mentions in
  the Robustness-loops + Cost-guardrails sections. Plus parity checks
  for the prior-knob inventory and the install-script header content.
  Eight tests, all passing — adding a new robustness knob in the
  future requires touching all four files or the suite fails.

All 188 unit tests pass.

https://claude.ai/code/session_01XYLA51b49wz4S2THibSjaF
- README ## Docker section: add a callout block explaining that
  CLK_RUN_INSTALL should stay false (the default) in Docker because
  the Dockerfile already installs all deps at build time.
- README ## First-run setup wizard: expand the 'Loop settings' bullet
  to explain what the install flag does and explicitly say to leave it
  false in Docker.
- kickoff.sh setup wizard: extend the _sv_explain text for the
  'run install' prompt to tell users running in Docker that false is
  correct because the image ships with deps pre-installed.

https://claude.ai/code/session_01XYLA51b49wz4S2THibSjaF
@BillJr99 BillJr99 merged commit c07c5bb into master May 24, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants