Skip to content

feat(skills): the Improve skill family (agentic, self-evolving)#15

Merged
drewstone merged 2 commits into
mainfrom
feat/improve-skill-family
Jun 6, 2026
Merged

feat(skills): the Improve skill family (agentic, self-evolving)#15
drewstone merged 2 commits into
mainfrom
feat/improve-skill-family

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

What

Five agent-facing skills (.claude/skills/*, mirroring the existing eval-campaign skill) that encode how an agent builds and runs a self-improvement loop for a product it has never seen — and does it trustworthily. They are the judgment layer above the eval-campaign engine shipped in #13/#14: the engine optimizes; these skills are what keep the optimization from perfecting a fiction.

Distilled directly from repairing legal-agent's gepaDriver loop end-to-end this session — every skill's worked example is a real failure we hit.

Skill Holds the line on
eval-architect Measure the real deliverable, not a proxy (the empty-string / wrong-channel bug)
measurement-validation Prove the metric is sound before spending; fail loud on incomplete/unpaired evidence (the fake +47)
surface-evolution Run the gated loop; promote with no offline/online drift; never regress a guarded dimension
improve-conductor The user-facing Improve button: calibrated, evidence-gated promotion — trust over lift
skill-evolution The meta: each skill is a measured hypothesis that improves from outcomes; the agent-builder north-star

Why this shape (agentic, not a rulebook)

Every skill follows a 4-part contract:

  • Invariant — the 1–2 laws that, if violated, make the loop a slot machine. Human-owned, frozen.
  • Judgment — what the agent figures out for this product. Loop-owned, wide. The agentic surface.
  • Self-test — a checkable result the agent ran (mutation test, CI < delta, diff-the-deployed-surface), not "I followed the steps."
  • Evolves-by — the outcome data that updates the judgment surface (never the invariants).

Few frozen invariants hold the line; judgment is broad and loop-owned; outcomes are measured; the judgment surface self-revises. That split is how a skill stays adaptive to an unforeseen product without drifting into either a brittle checklist or unaccountable vibes.

The recursion / north-star

skill-evolution points the same loop the skills describe at the skills themselves: a skill's judgment surface is optimized by the verifiable reward "did following this produce an eval that yielded real held-out lift, no critical regression?" — which is exactly the agent-builder north-star: the produced eval must yield real held-out lift on the agent it built. The fleet (legal/tax/gtm/creative/insurance) is the training distribution; legal-agent's repaired loop is dogfood data point #1.

Worked failures baked in as examples

  • empty-string scoring after a deliverable moved channels (→ eval-architect)
  • ~6 rounds chasing ±0.15 noise as signal (→ measurement-validation)
  • a reported heldOutLift=+47 that was two different personas because 2 of 4 holdout cells errored (→ measurement-validation, and the consumer-side guard now landing in legal-agent #155)

Follow-up (not in this PR)

An @tangle-network/agent-app/improve module that wires these skills to a typed defineImproveTarget + a scaffold_eval app-tool + budget-bounded runImprove, mirroring the knowledge-loop declarative→running mapper. The skills describe the contract; the module would codify the seam.

Five skills that encode HOW an agent builds + runs a self-improvement loop for
a product it has never seen — distilled from repairing legal-agent's gepaDriver
loop end-to-end. They sit above the eval-campaign engine (#13/#14): the engine
optimizes; these skills are the judgment that makes the optimization trustworthy.

- eval-architect          measure the REAL deliverable, not a proxy (the
                          empty-string / wrong-channel failure)
- measurement-validation  prove the metric is sound before spending; fail loud
                          on incomplete/unpaired evidence (the fake +47)
- surface-evolution       run the gated loop; promote without offline/online
                          drift; never regress a guarded dimension
- improve-conductor       the user-facing Improve button: calibrated, evidence-
                          gated promotion — trust over lift
- skill-evolution         the meta: each skill is a measured hypothesis (frozen
                          invariants + an evolvable judgment surface optimized by
                          its own meta-eval). The agent-builder north-star: the
                          produced eval yields real held-out lift on the agent it
                          built; the fleet is the training distribution.

Every skill follows a 4-part agentic contract — Invariant (frozen, human-owned) /
Judgment (wide, loop-owned) / Self-test (a checkable result) / Evolves-by — so it
stays adaptive without drifting. Grounded in this session's concrete failures as
worked examples.
@tangletools
Copy link
Copy Markdown

✅ No Blockers — cd15036b

Readiness 89/100 · Confidence 65/100 · 2 findings (2 low)

deepseek: Correctness 89 · Security 89 · Testing 89 · Architecture 89

Full multi-shot audit completed 1/1 planned shots over 5 changed files. Global verifier still owns final merge decision.

🟡 LOW Paired-n floor of ≥3 may be too permissive for noisy metrics — .claude/skills/measurement-validation/SKILL.md

Line 15 sets paired-n floor at ≥3. With n=3, a paired t-test has df=2, requiring very large effects to reach significance. The Judgment section (line 25: 'Noisy targets need 5+') partially addresses this, but the Invariant floor of 3 could be met while the Judgment section's 5+ threshold is unmet — creating ambiguity about which constraint binds. Clarify whether the floor is a hard minimum (invariant) and the 5+ is a soft guideline (judgment), or unify them.

🟡 LOW Bootstrapping gap in meta-eval recursion not addressed — .claude/skills/skill-evolution/SKILL.md

Line 35 states the first dogfood data point exists from the legal-agent session that produced these skills. But the meta-eval that judges whether 'skills built this way yield real held-out lift' is itself unvalidated — the validator is validating itself. This is inherent to bootstrapped self-referential systems and is acknowledged honestly. Not actionable now, but the skill-evolution file should eventually describe the bootstrapping gate: at what evidence threshold does the meta-eval itself graduate from 'experimental' to 'production'.


tangletools · 2026-06-06T21:30:48Z · trace

Copy link
Copy Markdown

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 2 non-blocking findings — cd15036b

Full multi-shot audit completed 1/1 planned shots over 5 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-06T21:30:48Z · immutable trace

…tart

Closes the hole in the Improve family: the prior skills assumed the measurement
was buildable on request. They didn't answer the two hardest cold-start
questions — WHAT is the right thing to improve (or the agent perfects a proxy),
and WHO builds the apparatus when none exists (the improver must construct it,
not tune thin air). Without these, the improver confidently ships a toy.

- eval-bootstrap: the two-loop architecture (BUILD a validated, externally-
  grounded harness — often via a delegated agent-runtime loop — THEN optimize),
  with the anti-toy / anti-circular invariants: no spend until the target is
  user-confirmed + tied to product value + the gold is grounded in EXTERNAL
  truth (never gold the agent invents and grades itself against) + the harness
  passes measurement-validation (it RUNS, not just compiles). Self-tests:
  "would the user agree with these scores?", the mutation test, the
  non-circularity check.
- improve-conductor: added the cold-start gate — invariant #4 (no optimization
  spend before a confirmed target + validated measurement; dispatch
  eval-bootstrap first) and the explicit two-step framing.
Copy link
Copy Markdown

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — c0033867

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-06T21:31:39Z

@tangletools
Copy link
Copy Markdown

✅ No Blockers — c0033867

Readiness 89/100 · Confidence 65/100 · 3 findings (3 low)

deepseek glm aggregate
Readiness 92 89 89
Confidence 65 65 65
Correctness 92 89 89
Security 92 89 89
Testing 92 89 89
Architecture 92 89 89

Full multi-shot audit completed 1/1 planned shots over 6 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 6 changed files. Global verifier still owns final merge decision.

🟡 LOW eval-bootstrap references knowledge-loop subpath without version constraint — .claude/skills/eval-bootstrap/SKILL.md

Line 24 references @tangle-network/agent-app/knowledge-loop's source-grounded acquisition. The subpath exists and is valid in this tree, but unlike eval-campaign (which documents a peer-dep floor agent-eval >= 0.81.0), eval-bootstrap doesn't state whether any minimum version is required. Low risk since the reference is descriptive (skill prose), not importable code.

🟡 LOW skill-evolution enumerates governed skills but omits eval-bootstrap — .claude/skills/skill-evolution/SKILL.md

Line 10: 'It governs eval-architect, measurement-validation, surface-evolution, and improve-conductor' — but eval-bootstrap is also a member of the Improve family that follows the 4-part contract and is cross-referenced by improve-conductor. The list should include it for completeness, or be rewritten as a non-exhaustive reference. No functional impact (skill-loading doesn't depend on this), but it's an internal consistency gap.

🟡 LOW Documentation: runImprovementLoop not actually re-exported — .claude/skills/surface-evolution/SKILL.md

Line 10 states runImprovementLoop is among the symbols re-exported via @tangle-network/agent-app/eval-campaign. src/eval-campaign/index.ts:119-125 re-exports runCampaign (not runImprovementLoop). The eval-campaign module deliberately avoids re-exporting runImprovementLoop (line 8 comment: 'A product should NOT hand-roll runImprovementLoop'). Fix: replace runImprovementLoop with runCampaign in the parenthetical list.


tangletools · 2026-06-06T21:35:57Z · trace

@drewstone drewstone merged commit ce246e3 into main Jun 6, 2026
1 check passed
@drewstone drewstone deleted the feat/improve-skill-family branch June 6, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants