Skip to content

chore(release): agent-app 0.2.0 — perfect the eval-campaign surface#14

Merged
drewstone merged 1 commit into
mainfrom
chore/eval-campaign-release
Jun 5, 2026
Merged

chore(release): agent-app 0.2.0 — perfect the eval-campaign surface#14
drewstone merged 1 commit into
mainfrom
chore/eval-campaign-release

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Cuts the agent-app release that makes the /eval-campaign self-improvement surface consumable by the product agents, on a now-complete substrate.

Why 0.2.0 (not 0.1.x)

Bumps the @tangle-network/agent-eval peer floor >=0.81.0>=0.82.0 — the version where selfImprove forwards the full loop surface (reps, promoteTopK, labeledStore, captureSource, expectUsage, analyzeGeneration, findings) and defaults expectUsage: 'assert'. A peer-floor tightening is breaking for consumers below it, so this is a minor bump in 0.x semver. Only consumers that upgrade are affected — and they upgrade to get this surface.

What this unblocks

With 0.2.0 published, a product agent collapses its entire hand-rolled runImprovementLoop + emitLoopProvenance harness onto a single selfImprove call (plus buildEnsembleJudge for multi-model judges) with zero regression — capture, replicates, the analyst loop, and the fail-loud integrity guard all carry through.

Changed

  • Peer + dev floor on agent-eval → 0.82.0.
  • SKILL.md config table documents the now-complete knob set + the 'assert' fail-loud default.

Verification

  • pnpm typecheck + pnpm build clean against published agent-eval 0.82.0
  • pnpm test — 181 passing
  • This release also carries the previously-merged store/runtime features on main (createAgentRuntime, KVStore, createDatabaseProvider).

Bumps the @tangle-network/agent-eval floor to >=0.82.0 (the version where
selfImprove forwards the full loop surface + defaults expectUsage to 'assert'),
so a product agent collapses its entire loop harness onto one selfImprove call
with no regression. SKILL.md documents the now-complete knob set (reps,
promoteTopK, labeledStore/captureSource, analyzeGeneration) and the fail-loud
default.

Minor bump (0.1.x → 0.2.0): the peer-floor tightening is breaking for consumers
below 0.82.0.

@tangletools tangletools left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cuts agent-app 0.2.0: peer floor → agent-eval 0.82.0 (complete selfImprove surface), skill documents the full knob set + fail-loud default. Typecheck/build/181 tests green against published 0.82.0. Approving the cut.

@drewstone drewstone merged commit d7d2d93 into main Jun 5, 2026
1 check passed
@drewstone drewstone deleted the chore/eval-campaign-release branch June 5, 2026 23:32
drewstone added a commit that referenced this pull request Jun 6, 2026
* feat(skills): the Improve skill family (agentic, self-evolving)

Five skills that encode HOW an agent builds + runs a self-improvement loop for
a product it has never seen — distilled from repairing legal-agent's gepaDriver
loop end-to-end. They sit above the eval-campaign engine (#13/#14): the engine
optimizes; these skills are the judgment that makes the optimization trustworthy.

- eval-architect          measure the REAL deliverable, not a proxy (the
                          empty-string / wrong-channel failure)
- measurement-validation  prove the metric is sound before spending; fail loud
                          on incomplete/unpaired evidence (the fake +47)
- surface-evolution       run the gated loop; promote without offline/online
                          drift; never regress a guarded dimension
- improve-conductor       the user-facing Improve button: calibrated, evidence-
                          gated promotion — trust over lift
- skill-evolution         the meta: each skill is a measured hypothesis (frozen
                          invariants + an evolvable judgment surface optimized by
                          its own meta-eval). The agent-builder north-star: the
                          produced eval yields real held-out lift on the agent it
                          built; the fleet is the training distribution.

Every skill follows a 4-part agentic contract — Invariant (frozen, human-owned) /
Judgment (wide, loop-owned) / Self-test (a checkable result) / Evolves-by — so it
stays adaptive without drifting. Grounded in this session's concrete failures as
worked examples.

* feat(skills): eval-bootstrap — build the apparatus for real at cold start

Closes the hole in the Improve family: the prior skills assumed the measurement
was buildable on request. They didn't answer the two hardest cold-start
questions — WHAT is the right thing to improve (or the agent perfects a proxy),
and WHO builds the apparatus when none exists (the improver must construct it,
not tune thin air). Without these, the improver confidently ships a toy.

- eval-bootstrap: the two-loop architecture (BUILD a validated, externally-
  grounded harness — often via a delegated agent-runtime loop — THEN optimize),
  with the anti-toy / anti-circular invariants: no spend until the target is
  user-confirmed + tied to product value + the gold is grounded in EXTERNAL
  truth (never gold the agent invents and grades itself against) + the harness
  passes measurement-validation (it RUNS, not just compiles). Self-tests:
  "would the user agree with these scores?", the mutation test, the
  non-circularity check.
- improve-conductor: added the cold-start gate — invariant #4 (no optimization
  spend before a confirmed target + validated measurement; dispatch
  eval-bootstrap first) and the explicit two-step framing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants