Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions .claude/skills/eval-campaign/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
name: eval-campaign
description: Wire a product agent's self-improvement loop (measure → optimize → gate → ship) onto the shared @tangle-network/agent-app/eval-campaign scaffold. Use when adding or refactoring any product agent's eval/ loop.
---

# Wiring a product onto the eval-campaign scaffold

You are integrating a product agent's self-improvement loop. The loop **engine already exists** in the substrate — do not rebuild it. Your job is to supply the three things only the product knows, and call one function.

## Mental model (read first)

`selfImprove` (from `@tangle-network/agent-eval/contract`, re-exported here) owns the entire cycle:

- the **train/holdout split** from a flat `scenarios` array,
- the **driver** (default `gepaDriver` from your `mutationPrimitives`),
- the **held-out production gate** (default `defaultProductionGate`, `deltaThreshold` 0.05),
- **durable provenance** + optional hosted ingest,
- every budget/seed/storage default.

A product brings exactly three things:

1. **`scenarios`** — your corpus (personas / cases / tasks) in the substrate `Scenario` shape.
2. **`agent`** — `(surface, scenario, ctx) => artifact`: run your agent under the current surface (a system-prompt addendum the loop optimizes) and return the artifact your judge scores. Report real cost via `ctx.cost.observe(...)` so the backend-integrity guard sees a real run.
3. **`judge`** — score an artifact on your rubric. Use `buildEnsembleJudge` (below) for a multi-model ensemble, or hand-write a `JudgeConfig` for a bespoke composite.

Everything else is a default you override only when you have a reason.

## The one import

```ts
import {
selfImprove,
buildEnsembleJudge,
type SelfImproveOptions,
type JudgeVerdict,
} from '@tangle-network/agent-app/eval-campaign'
```

> Requires `@tangle-network/agent-eval >= 0.81.0` (peer). The scaffold composes the substrate downward; never import a product package from agent-eval (layering rule).

## Minimal wiring (copy, then fill the three blanks)

```ts
const RUBRIC = ['accuracy', 'grounding', 'tone'] as const
type Dim = (typeof RUBRIC)[number]

const judge = buildEnsembleJudge<MyArtifact, MyScenario, Dim>({
name: 'my-product',
rubric: RUBRIC,
judgeReps: 3, // 3 uncorrelated judges → inter-rater bands
async scoreOne({ artifact, scenario, rep }) {
const model = JUDGE_MODELS[rep % JUDGE_MODELS.length] // vary the model per rep
try {
const v = await callMyJudge(model, artifact, scenario) // → { accuracy, grounding, tone }
return { model, perDimension: v, rationale: v.note, costUsd: v.cost }
} catch (err) {
return { model, perDimension: null, rationale: String(err) } // failure ≠ zero
}
},
})

const result = await selfImprove<MyScenario, MyArtifact>({
scenarios: loadMyScenarios(), // YOU own
agent: dispatchUnderSurface, // YOU own — (surface, scenario, ctx) => artifact
judge, // built above
baselineSurface: '', // the addendum the loop optimizes (start empty)
mutationPrimitives: MY_DIRECTIVES, // the optimization levers (default driver mutates toward these)
runDir: process.env.MY_RUN_DIR, // a real path → durable provenance; omit → in-memory
// budget / model / gate / hostedTenant all default — override only when needed
})

if (result.gate.decision === 'ship') await ship(result.winnerSurface)
```

## `buildEnsembleJudge` contract

- `scoreOne` is called `judgeReps` times per artifact; **vary the model by `rep`** so the ensemble is uncorrelated (judges sharing a base model share its bias).
- Return `{ model, perDimension: null }` to record a judge failure **without** killing the ensemble — the reducer means over survivors.
- The reducer (`aggregateJudgeVerdicts`) **throws only if every rep failed** → the campaign records a failed cell, never a silent zero.
- `weights` (partial) selects-and-weights named dimensions; default is uniform.

## Config reference (all `SelfImproveOptions`, all optional unless noted)

| Field | Default | When to set |
|---|---|---|
| `scenarios` | — (required) | your corpus |
| `agent` | — (required) | your dispatch under a surface |
| `judge` | — (required) | `buildEnsembleJudge` or a `JudgeConfig` |
| `baselineSurface` | — (required) | the surface the loop optimizes; start `''` |
| `mutationPrimitives` | gepaDriver's own | your optimization levers (additive directives) |
| `driver` | `gepaDriver` | pass `evolutionaryDriver({ mutator })` for blind addendum rotation |
| `gate` | `defaultProductionGate` (Δ 0.05) | `paretoSignificanceGate` for multi-objective; tune `deltaThreshold` for your rubric scale |
| `budget` | 3 gens × pop 2, 0.25 holdout | raise for deeper search |
| `runDir` | `mem://…` (non-durable) | a real path to persist provenance + spans |
| `hostedTenant` | off | ship eval-run events to a hosted orchestrator |
| `collectWorkerRecords` | — | return the per-call `RunRecord`s your agent accumulated → real backend-integrity verdict |
| `onProgress` | — | stream baseline/generation/gate events to a UI |

## Fail-loud contract (do not break)

- In `agent`, report real cost via `ctx.cost.observe(costUsd, label)` + `ctx.cost.observeTokens(...)`. A dispatch that reports `{0,0}` trips `expectUsage` — that is the honest "ran against a stub" signal; never paper over it.
- A judge failure is `perDimension: null`, never a fabricated zero.
- Train and holdout must both be non-empty (`selfImprove` derives the split; supply enough scenarios).

## Anti-patterns (these are what this scaffold deletes)

- ❌ Hand-rolling `runImprovementLoop({...})` + `emitLoopProvenance({...})` + a train/holdout split. That is ~100 lines of identical boilerplate per product. Call `selfImprove`.
- ❌ A per-product copy of the judge-ensemble reducer (survivor-mean / disagreement / cost-sum). Use `buildEnsembleJudge` → `aggregateJudgeVerdicts`.
- ❌ `import type` from a product package inside the scaffold or substrate (upward dependency — forbidden).

## Where it lives in the product

One file: `eval/self-improve.ts`. It exports `runMyEval` (measure: `selfImprove` with `budget.generations = 0`, or `runCampaign`) and `runMySelfImprovement` (optimize: the wiring above). The product's harness/CLI calls these; nothing else duplicates the loop.
9 changes: 7 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@
"import": "./dist/eval/index.js",
"default": "./dist/eval/index.js"
},
"./eval-campaign": {
"types": "./dist/eval-campaign/index.d.ts",
"import": "./dist/eval-campaign/index.js",
"default": "./dist/eval-campaign/index.js"
},
"./knowledge": {
"types": "./dist/knowledge/index.d.ts",
"import": "./dist/knowledge/index.js",
Expand Down Expand Up @@ -131,7 +136,7 @@
"typecheck": "tsc --noEmit"
},
"devDependencies": {
"@tangle-network/agent-eval": "^0.70.0",
"@tangle-network/agent-eval": "^0.81.0",
"@tangle-network/agent-integrations": "^0.32.0",
"@tangle-network/agent-knowledge": "^1.5.2",
"@types/node": "^25.6.0",
Expand All @@ -140,7 +145,7 @@
"vitest": "^3.0.0"
},
"peerDependencies": {
"@tangle-network/agent-eval": ">=0.50.0",
"@tangle-network/agent-eval": ">=0.81.0",
"@tangle-network/agent-integrations": ">=0.32.0",
"@tangle-network/agent-knowledge": ">=1.5.0",
"@tangle-network/agent-runtime": ">=0.21.0"
Expand Down
10 changes: 5 additions & 5 deletions pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

98 changes: 98 additions & 0 deletions src/eval-campaign/index.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
/**
* buildEnsembleJudge wraps the substrate reducer into a JudgeConfig. These
* tests pin the contract the loop depends on: N reps fan out, a single rep
* failing does not fail the cell, all-failed throws (failed cell, not a zero),
* and the JudgeScore shape is exactly what runCampaign/selfImprove consume.
*/

import { describe, expect, it } from 'vitest'

import type { Scenario } from '@tangle-network/agent-eval/campaign'
import { buildEnsembleJudge } from './index'

type Dim = 'accuracy' | 'tone'
const RUBRIC = ['accuracy', 'tone'] as const

interface Art {
text: string
}
const scenario: Scenario = { id: 's1', kind: 'test' }
const signal = new AbortController().signal

describe('buildEnsembleJudge', () => {
it('fans out judgeReps calls and returns the JudgeScore shape', async () => {
let calls = 0
const judge = buildEnsembleJudge<Art, Scenario, Dim>({
name: 'test',
rubric: RUBRIC,
judgeReps: 3,
async scoreOne({ rep }) {
calls++
return { model: `m${rep}`, perDimension: { accuracy: 0.8, tone: 0.6 } }
},
})
const score = await judge.score({ artifact: { text: 'x' }, scenario, signal })
expect(calls).toBe(3)
expect(score.dimensions.accuracy).toBeCloseTo(0.8, 5)
expect(score.dimensions.tone).toBeCloseTo(0.6, 5)
expect(score.composite).toBeCloseTo(0.7, 5)
expect(typeof score.notes).toBe('string')
})

it('exposes the rubric as JudgeDimensions', () => {
const judge = buildEnsembleJudge<Art, Scenario, Dim>({
name: 'test',
rubric: RUBRIC,
describe: (d) => `desc:${d}`,
async scoreOne() {
return { model: 'm', perDimension: { accuracy: 1, tone: 1 } }
},
})
expect(judge.dimensions).toEqual([
{ key: 'accuracy', description: 'desc:accuracy' },
{ key: 'tone', description: 'desc:tone' },
])
})

it('a single rep failing does NOT fail the cell — means over survivors', async () => {
const judge = buildEnsembleJudge<Art, Scenario, Dim>({
name: 'test',
rubric: RUBRIC,
judgeReps: 2,
async scoreOne({ rep }) {
if (rep === 0) throw new Error('judge 0 down')
return { model: 'm1', perDimension: { accuracy: 0.9, tone: 0.9 } }
},
})
const score = await judge.score({ artifact: { text: 'x' }, scenario, signal })
expect(score.dimensions.accuracy).toBeCloseTo(0.9, 5) // survivor only, not (0.9+0)/2
})

it('throws (failed cell, not a zero) when every rep fails', async () => {
const judge = buildEnsembleJudge<Art, Scenario, Dim>({
name: 'test',
rubric: RUBRIC,
judgeReps: 2,
async scoreOne() {
throw new Error('all down')
},
})
await expect(judge.score({ artifact: { text: 'x' }, scenario, signal })).rejects.toThrow(
/all 2 judges failed/,
)
})

it('rejects an empty rubric and judgeReps < 1', () => {
expect(() =>
buildEnsembleJudge<Art, Scenario, Dim>({ name: 't', rubric: [], scoreOne: async () => ({ model: 'm', perDimension: null }) }),
).toThrow(/rubric is empty/)
expect(() =>
buildEnsembleJudge<Art, Scenario, Dim>({
name: 't',
rubric: RUBRIC,
judgeReps: 0,
scoreOne: async () => ({ model: 'm', perDimension: { accuracy: 1, tone: 1 } }),
}),
).toThrow(/judgeReps must be >= 1/)
})
})
Loading
Loading