Skip to content

feat(eval-campaign): shared self-improvement scaffold + integration skill#13

Merged
drewstone merged 1 commit into
mainfrom
feat/eval-campaign
Jun 5, 2026
Merged

feat(eval-campaign): shared self-improvement scaffold + integration skill#13
drewstone merged 1 commit into
mainfrom
feat/eval-campaign

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

A curated app-shell surface (@tangle-network/agent-app/eval-campaign) so any product agent wires its measure → optimize → gate → ship loop from one import, instead of each hand-rolling the loop harness around the substrate.

What the substrate already owns (and this does NOT rebuild)

selfImprove (@tangle-network/agent-eval/contract) already owns the whole cycle: train/holdout split, the GEPA driver, the held-out production gate, durable provenance + hosted ingest, every default. The duplication across product agents is the harness wrapped around it — the identical runImprovementLoop({...}) + emitLoopProvenance({...}) blocks and a re-implemented judge-ensemble reducer.

What this adds

  • buildEnsembleJudge — turns a per-rubric scoreOne into a JudgeConfig that fans out N uncorrelated judge calls and reduces them via the substrate's aggregateJudgeVerdicts (agent-eval 0.81.0): survivor-mean per dimension, inter-rater disagreement spread, cost sum. Fail-loud — one judge failing is recorded and dropped (never a zero); ALL failing throws → a failed cell, never a silent zero.
  • Curated re-exportsselfImprove, defaultProductionGate, paretoSignificanceGate, gepaDriver, evolutionaryDriver, runCampaign, and the types, so a product imports from one module rather than three agent-eval subpaths.
  • SKILL.md (auto-discovered as /eval-campaign) — the Flue-style integration breadcrumb: the three things a product owns (scenarios, agent, judge), a copy-paste minimal wiring, every knob + default, the layering rule, the fail-loud contract, and the anti-patterns it deletes.

Layering

The scaffold lives in the consumer (agent-app) and composes the substrate downward — no upward import. aggregateJudgeVerdicts/selfImprove/gates/drivers all come from agent-eval.

Note for review

Peer floor on @tangle-network/agent-eval moves >=0.50.0>=0.81.0 (the version exposing aggregateJudgeVerdicts). Only consumers that upgrade to this agent-app are affected — and they upgrade precisely to get this surface. Flagging in case you'd prefer a narrower floor.

Verification

  • pnpm typecheck clean (against published agent-eval 0.81.0)
  • pnpm build clean — dist/eval-campaign/{index.js,index.d.ts} emitted; new ./eval-campaign export wired in package.json + tsup
  • pnpm test181 passing (full suite + 5 new buildEnsembleJudge cases: fan-out, JudgeScore shape, one-rep-fail-survives, all-fail-throws, empty-guards)

Follow-up (separate PRs, after this publishes)

  • Product agents collapse their hand-rolled loop harness onto selfImprove + buildEnsembleJudge.

…kill

A curated app-shell surface so any product agent wires its eval/self-improve
loop from one import instead of hand-rolling the loop harness.

- buildEnsembleJudge: turns a per-rubric scoreOne into a JudgeConfig that fans
  out N uncorrelated judge calls and reduces them via the substrate's
  aggregateJudgeVerdicts (survivor-mean, inter-rater spread, fail-loud on
  all-failed — a single judge failing never fails the cell, all failing throws
  a failed cell rather than a silent zero).
- Re-exports selfImprove + the gates + drivers + types, so a product does not
  reach across three agent-eval subpaths.
- SKILL.md (auto-discovered as /eval-campaign): the integration contract — the
  three things a product owns (scenarios, agent, judge), every knob + default,
  the layering rule, the fail-loud contract, and the boilerplate it deletes.

selfImprove already owns the loop/split/gate/provenance; this surface adds only
the judge constructor + curation. Peer floor on @tangle-network/agent-eval
moves to >=0.81.0 (the version exposing aggregateJudgeVerdicts).

typecheck + build green; 5 new tests; full suite 181 passing.
Copy link
Copy Markdown

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thin app-shell façade over the substrate (selfImprove + aggregateJudgeVerdicts); buildEnsembleJudge is fail-loud (one judge fails → dropped, all fail → failed cell not zero). SKILL.md is the integration contract. 181 tests green, typechecks against published 0.81.0. Approving.

@drewstone drewstone merged commit f781c07 into main Jun 5, 2026
1 check passed
@drewstone drewstone deleted the feat/eval-campaign branch June 5, 2026 23:03
drewstone added a commit that referenced this pull request Jun 6, 2026
* feat(skills): the Improve skill family (agentic, self-evolving)

Five skills that encode HOW an agent builds + runs a self-improvement loop for
a product it has never seen — distilled from repairing legal-agent's gepaDriver
loop end-to-end. They sit above the eval-campaign engine (#13/#14): the engine
optimizes; these skills are the judgment that makes the optimization trustworthy.

- eval-architect          measure the REAL deliverable, not a proxy (the
                          empty-string / wrong-channel failure)
- measurement-validation  prove the metric is sound before spending; fail loud
                          on incomplete/unpaired evidence (the fake +47)
- surface-evolution       run the gated loop; promote without offline/online
                          drift; never regress a guarded dimension
- improve-conductor       the user-facing Improve button: calibrated, evidence-
                          gated promotion — trust over lift
- skill-evolution         the meta: each skill is a measured hypothesis (frozen
                          invariants + an evolvable judgment surface optimized by
                          its own meta-eval). The agent-builder north-star: the
                          produced eval yields real held-out lift on the agent it
                          built; the fleet is the training distribution.

Every skill follows a 4-part agentic contract — Invariant (frozen, human-owned) /
Judgment (wide, loop-owned) / Self-test (a checkable result) / Evolves-by — so it
stays adaptive without drifting. Grounded in this session's concrete failures as
worked examples.

* feat(skills): eval-bootstrap — build the apparatus for real at cold start

Closes the hole in the Improve family: the prior skills assumed the measurement
was buildable on request. They didn't answer the two hardest cold-start
questions — WHAT is the right thing to improve (or the agent perfects a proxy),
and WHO builds the apparatus when none exists (the improver must construct it,
not tune thin air). Without these, the improver confidently ships a toy.

- eval-bootstrap: the two-loop architecture (BUILD a validated, externally-
  grounded harness — often via a delegated agent-runtime loop — THEN optimize),
  with the anti-toy / anti-circular invariants: no spend until the target is
  user-confirmed + tied to product value + the gold is grounded in EXTERNAL
  truth (never gold the agent invents and grades itself against) + the harness
  passes measurement-validation (it RUNS, not just compiles). Self-tests:
  "would the user agree with these scores?", the mutation test, the
  non-circularity check.
- improve-conductor: added the cold-start gate — invariant #4 (no optimization
  spend before a confirmed target + validated measurement; dispatch
  eval-bootstrap first) and the explicit two-step framing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants