feat(skills): the Improve skill family (agentic, self-evolving) by drewstone · Pull Request #15 · tangle-network/agent-app

drewstone · 2026-06-06T21:22:08Z

What

Five agent-facing skills (.claude/skills/*, mirroring the existing eval-campaign skill) that encode how an agent builds and runs a self-improvement loop for a product it has never seen — and does it trustworthily. They are the judgment layer above the eval-campaign engine shipped in #13/#14: the engine optimizes; these skills are what keep the optimization from perfecting a fiction.

Distilled directly from repairing legal-agent's gepaDriver loop end-to-end this session — every skill's worked example is a real failure we hit.

Skill	Holds the line on
`eval-architect`	Measure the real deliverable, not a proxy (the empty-string / wrong-channel bug)
`measurement-validation`	Prove the metric is sound before spending; fail loud on incomplete/unpaired evidence (the fake `+47`)
`surface-evolution`	Run the gated loop; promote with no offline/online drift; never regress a guarded dimension
`improve-conductor`	The user-facing Improve button: calibrated, evidence-gated promotion — trust over lift
`skill-evolution`	The meta: each skill is a measured hypothesis that improves from outcomes; the agent-builder north-star

Why this shape (agentic, not a rulebook)

Every skill follows a 4-part contract:

Invariant — the 1–2 laws that, if violated, make the loop a slot machine. Human-owned, frozen.
Judgment — what the agent figures out for this product. Loop-owned, wide. The agentic surface.
Self-test — a checkable result the agent ran (mutation test, CI < delta, diff-the-deployed-surface), not "I followed the steps."
Evolves-by — the outcome data that updates the judgment surface (never the invariants).

Few frozen invariants hold the line; judgment is broad and loop-owned; outcomes are measured; the judgment surface self-revises. That split is how a skill stays adaptive to an unforeseen product without drifting into either a brittle checklist or unaccountable vibes.

The recursion / north-star

skill-evolution points the same loop the skills describe at the skills themselves: a skill's judgment surface is optimized by the verifiable reward "did following this produce an eval that yielded real held-out lift, no critical regression?" — which is exactly the agent-builder north-star: the produced eval must yield real held-out lift on the agent it built. The fleet (legal/tax/gtm/creative/insurance) is the training distribution; legal-agent's repaired loop is dogfood data point #1.

Worked failures baked in as examples

empty-string scoring after a deliverable moved channels (→ eval-architect)
~6 rounds chasing ±0.15 noise as signal (→ measurement-validation)
a reported heldOutLift=+47 that was two different personas because 2 of 4 holdout cells errored (→ measurement-validation, and the consumer-side guard now landing in legal-agent #155)

Follow-up (not in this PR)

An @tangle-network/agent-app/improve module that wires these skills to a typed defineImproveTarget + a scaffold_eval app-tool + budget-bounded runImprove, mirroring the knowledge-loop declarative→running mapper. The skills describe the contract; the module would codify the seam.

Five skills that encode HOW an agent builds + runs a self-improvement loop for a product it has never seen — distilled from repairing legal-agent's gepaDriver loop end-to-end. They sit above the eval-campaign engine (#13/#14): the engine optimizes; these skills are the judgment that makes the optimization trustworthy. - eval-architect measure the REAL deliverable, not a proxy (the empty-string / wrong-channel failure) - measurement-validation prove the metric is sound before spending; fail loud on incomplete/unpaired evidence (the fake +47) - surface-evolution run the gated loop; promote without offline/online drift; never regress a guarded dimension - improve-conductor the user-facing Improve button: calibrated, evidence- gated promotion — trust over lift - skill-evolution the meta: each skill is a measured hypothesis (frozen invariants + an evolvable judgment surface optimized by its own meta-eval). The agent-builder north-star: the produced eval yields real held-out lift on the agent it built; the fleet is the training distribution. Every skill follows a 4-part agentic contract — Invariant (frozen, human-owned) / Judgment (wide, loop-owned) / Self-test (a checkable result) / Evolves-by — so it stays adaptive without drifting. Grounded in this session's concrete failures as worked examples.

tangletools · 2026-06-06T21:30:50Z

✅ No Blockers — `cd15036b`

Readiness 89/100 · Confidence 65/100 · 2 findings (2 low)

deepseek: Correctness 89 · Security 89 · Testing 89 · Architecture 89

Full multi-shot audit completed 1/1 planned shots over 5 changed files. Global verifier still owns final merge decision.

🟡 LOW Paired-n floor of ≥3 may be too permissive for noisy metrics — .claude/skills/measurement-validation/SKILL.md

Line 15 sets paired-n floor at ≥3. With n=3, a paired t-test has df=2, requiring very large effects to reach significance. The Judgment section (line 25: 'Noisy targets need 5+') partially addresses this, but the Invariant floor of 3 could be met while the Judgment section's 5+ threshold is unmet — creating ambiguity about which constraint binds. Clarify whether the floor is a hard minimum (invariant) and the 5+ is a soft guideline (judgment), or unify them.

🟡 LOW Bootstrapping gap in meta-eval recursion not addressed — .claude/skills/skill-evolution/SKILL.md

Line 35 states the first dogfood data point exists from the legal-agent session that produced these skills. But the meta-eval that judges whether 'skills built this way yield real held-out lift' is itself unvalidated — the validator is validating itself. This is inherent to bootstrapped self-referential systems and is acknowledged honestly. Not actionable now, but the skill-evolution file should eventually describe the bootstrapping gate: at what evidence threshold does the meta-eval itself graduate from 'experimental' to 'production'.

_{tangletools · 2026-06-06T21:30:48Z · trace}

tangletools

✅ Approved — 2 non-blocking findings — `cd15036b`

Full multi-shot audit completed 1/1 planned shots over 5 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-06T21:30:48Z · immutable trace}

…tart Closes the hole in the Improve family: the prior skills assumed the measurement was buildable on request. They didn't answer the two hardest cold-start questions — WHAT is the right thing to improve (or the agent perfects a proxy), and WHO builds the apparatus when none exists (the improver must construct it, not tune thin air). Without these, the improver confidently ships a toy. - eval-bootstrap: the two-loop architecture (BUILD a validated, externally- grounded harness — often via a delegated agent-runtime loop — THEN optimize), with the anti-toy / anti-circular invariants: no spend until the target is user-confirmed + tied to product value + the gold is grounded in EXTERNAL truth (never gold the agent invents and grades itself against) + the harness passes measurement-validation (it RUNS, not just compiles). Self-tests: "would the user agree with these scores?", the mutation test, the non-circularity check. - improve-conductor: added the cold-start gate — invariant #4 (no optimization spend before a confirmed target + validated measurement; dispatch eval-bootstrap first) and the explicit two-step framing.

tangletools

✅ Refreshed approval after new commits — `c0033867`

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-06T21:31:39Z}

tangletools · 2026-06-06T21:35:59Z

✅ No Blockers — `c0033867`

Readiness 89/100 · Confidence 65/100 · 3 findings (3 low)

	deepseek	glm	aggregate
Readiness	92	89	89
Confidence	65	65	65
Correctness	92	89	89
Security	92	89	89
Testing	92	89	89
Architecture	92	89	89

Full multi-shot audit completed 1/1 planned shots over 6 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 6 changed files. Global verifier still owns final merge decision.

🟡 LOW eval-bootstrap references knowledge-loop subpath without version constraint — .claude/skills/eval-bootstrap/SKILL.md

Line 24 references @tangle-network/agent-app/knowledge-loop's source-grounded acquisition. The subpath exists and is valid in this tree, but unlike eval-campaign (which documents a peer-dep floor agent-eval >= 0.81.0), eval-bootstrap doesn't state whether any minimum version is required. Low risk since the reference is descriptive (skill prose), not importable code.

🟡 LOW skill-evolution enumerates governed skills but omits eval-bootstrap — .claude/skills/skill-evolution/SKILL.md

Line 10: 'It governs eval-architect, measurement-validation, surface-evolution, and improve-conductor' — but eval-bootstrap is also a member of the Improve family that follows the 4-part contract and is cross-referenced by improve-conductor. The list should include it for completeness, or be rewritten as a non-exhaustive reference. No functional impact (skill-loading doesn't depend on this), but it's an internal consistency gap.

🟡 LOW Documentation: runImprovementLoop not actually re-exported — .claude/skills/surface-evolution/SKILL.md

Line 10 states runImprovementLoop is among the symbols re-exported via @tangle-network/agent-app/eval-campaign. src/eval-campaign/index.ts:119-125 re-exports runCampaign (not runImprovementLoop). The eval-campaign module deliberately avoids re-exporting runImprovementLoop (line 8 comment: 'A product should NOT hand-roll runImprovementLoop'). Fix: replace runImprovementLoop with runCampaign in the parenthetical list.

_{tangletools · 2026-06-06T21:35:57Z · trace}

tangletools approved these changes Jun 6, 2026

View reviewed changes

drewstone merged commit ce246e3 into main Jun 6, 2026
1 check passed

drewstone deleted the feat/improve-skill-family branch June 6, 2026 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): the Improve skill family (agentic, self-evolving)#15

feat(skills): the Improve skill family (agentic, self-evolving)#15
drewstone merged 2 commits into
mainfrom
feat/improve-skill-family

drewstone commented Jun 6, 2026

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 6, 2026

What

Why this shape (agentic, not a rulebook)

The recursion / north-star

Worked failures baked in as examples

Follow-up (not in this PR)

Uh oh!

tangletools commented Jun 6, 2026

✅ No Blockers — cd15036b

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 2 non-blocking findings — cd15036b

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Refreshed approval after new commits — c0033867

Uh oh!

tangletools commented Jun 6, 2026

✅ No Blockers — c0033867

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `cd15036b`

✅ Approved — 2 non-blocking findings — `cd15036b`

✅ Refreshed approval after new commits — `c0033867`

✅ No Blockers — `c0033867`