Skip to content

feat(goal): ✨ 引入 Judge 验收 Agent 替代 goal_scored 自证#224

Merged
jorben merged 8 commits into
masterfrom
refact/goal-llm-judugement
Jun 7, 2026
Merged

feat(goal): ✨ 引入 Judge 验收 Agent 替代 goal_scored 自证#224
jorben merged 8 commits into
masterfrom
refact/goal-llm-judugement

Conversation

@HayWolf

@HayWolf HayWolf commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

概述

将 goal 完成判定从主 agent 自证(goal_scored 工具)改为独立 Judge subagent(agent_judge)验收,消除主 agent "既是运动员又是裁判"的信任缺陷。

核心变更

Judge Subagent 体系

  • 新增 agent_judge 子代理配置:depth=2、只读文件工具 + 诊断 shell(git_status/git_diff/term_status/term_output),禁止所有写工具
  • 新增 src-tauri/src/core/subagent/judge_contract.rsJudgeRequest/JudgeReport 结构化协议,extract_judge_report 对解析失败/空摘要/越界/无 findings 均做安全兜底
  • 新增 prompt 模板 templates/subagent/judge.md 与输出契约 output_contract.judge.md(version 1)

运行时编排

  • agent_session.rs:goal 存在且非 (Complete && judge_passed) 时注入 agent_judge
  • agent_session_execution.rs:新增 execute_judge_tool 完整流程(注入 goal 上下文 → 运行 Judge → 提取报告 → record_judge_verdict → passed 发 GoalCompleted 否则 GoalStateUpdated
  • 递归委派硬拒绝:resolve_delegation 对 Judge 返回 error,覆盖 parallel 路径

数据模型与持久化

  • GoalRecord/GoalDto/GoalPayload 新增 judge_passed/judge_completeness/judge_findings/judge_summary/judge_evaluated_run_id 五个字段
  • 迁移 20260607000000_goal_judge_fields.sql:ALTER 5 列,存量 status='complete' 回填 judge_passed=1, completeness=100
  • goal_repo.rs:新增事务型 record_judge_verdict(passed 时间事务写 status=complete + evidence=summary

Goal 管理器

  • 删除 GoalVerdict::Complete 变体、GOAL_SCORED_* 常量、MISSING_EVIDENCE_PROMPTChallengePromptVariant
  • evaluate_after_runComplete && judge_passed 停续行,Complete && !judge_passed 记 warning 仍停
  • 续行 prompt 改为引用 agent_judge 并追加最近 findings

清理

  • 删除 goal_scored 工具定义与执行分派(运行时/前端/gateway 零引用)
  • 移除 GoalManager::mark_complete(直写 status=complete 不设 judge_passed,产出非法组合,无生产调用)
  • 修复 BuiltinSubagent/AnySubagent doc comment 遗漏 judge
  • 前端 GoalEvaluateResult.verdict 删除死成员 "complete"

动机

主 agent 通过 goal_scored 自证 goal 完成存在显著信任缺陷——AI 可能给出自我感觉良好的评分。引入独立 Judge subagent 实现"验收方与执行方分离",Judge 需基于实际文件变更、git diff 和终端输出做出判断,无法仅凭对话上下文敷衍。

设计文档:docs/goal-judge-evaluation-refactor.md

测试

后端测试(全部通过)

  • goal_lifecycle:28 个测试(含 record_judge_verdict pass/fail、migration backfill、evaluate 各分支)
  • subagent 模块:73 个测试(含 judge_contract 10 个、runtime_orchestration Judge 4 个)
  • 全量:cargo test --locked 全部通过

前端检查

  • npm run typecheck:无错误
  • npm run test:unit:840 passed,1 skipped

格式

  • cargo fmt --check:干净

破坏性变更

无。存量 status=complete goal 已通过迁移回填 judge_passed=1;前端 GoalEvaluateResult 删除的 "complete" 字面量后端从不产出。

关联

Refs docs/goal-judge-evaluation-refactor.md

检查清单

  • 代码风格通过 cargo fmt --check
  • 后端测试 cargo test --locked 全部通过
  • 前端 npm run typecheck 无错误
  • 前端 npm run test:unit 全部通过
  • 已清理死代码(goal_scoredmark_completeGoalVerdict::Complete、前端 "complete" 字面量)
  • 文档注释已更新(BuiltinSubagent/AnySubagent 补 judge)

🤖 Generated with TiyCode

jorben added 2 commits June 7, 2026 11:54
…udge acceptance agent

Remove the `goal_scored` tool that allowed the main agent to
self-attest goal completion, replacing it with an `agent_judge`
built-in subagent that independently verifies goal attainment
against the project's current state.

Key changes:
- Add `SubagentProfile::Judge` with read-only file tools and
  diagnostic-only shell (soft constraint via prompt)
- Add `JudgeReport` structured contract (passed, completeness_pct,
  findings, summary) with safe fallback parsing
- Add `agent_judge` tool injection only for the main agent when
  an unverified goal exists; runtime gate blocks subagent/parallel
  recursion into Judge
- Add DB migration for `judge_passed`, `judge_completeness`,
  `judge_findings`, `judge_summary`, `judge_evaluated_run_id`
  columns with backfill for legacy `status='complete'` goals
- Replace continuation stop condition: `Complete && judge_passed`
  instead of `goal_scored`-driven status flip
- Rewrite continuation prompt to instruct main agent to call
  `agent_judge` and follow findings on rejection
- Add Judge prompt surface, templates, and output contract
- Update `active_goal.tpl.md` to reflect Judge acceptance flow
- Extend goal lifecycle tests for Judge pass/fail/legacy compat
Remove the mark_complete pathway from goals as completion will be
handled through a different mechanism:

- Remove mark_complete method from GoalManager
- Remove "complete" from GoalEvaluateResult verdict type
- Remove mark_complete test cases (evidence validation, etc.)
- Update subagent surface comments to include judge

BREAKING CHANGE: GoalEvaluateResult.verdict no longer includes "complete"
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

AI Code Review Summary

PR: #224 (feat(goal): ✨ 引入 Judge 验收 Agent 替代 goal_scored 自证)
Preferred language: English

Overall Assessment

Detected 2 actionable findings, prioritize CRITICAL/HIGH before merge.

Major Findings by Severity

  • CRITICAL (1)
    • src-tauri/src/persistence/repo/goal_repo.rs:8 - Missing migration for new judge columns and removed time_used_seconds
  • HIGH (1)
    • src-tauri/src/core/agent_session_execution.rs:1668 - Judge tool does not require goal to be Active

Actionable Suggestions

  • Add a database migration for the new goal schema (drop time_used_seconds, add judge_* columns).
  • Add an 'Active' status guard in execute_judge_tool to prevent acceptance of non-active goals.
  • Coordinate with frontend to update goal UI to handle the removal of time_used_seconds.
  • Run cargo check and npm run typecheck on the full workspace to catch any broken references from the removals of get_run_elapsed_seconds, get_active_run_elapsed_seconds, and timeUsedSeconds.
  • Search for timeUsedSeconds in src/**/*.{ts,tsx} and remove or update all accesses.
  • Verify that the Judge subagent templates (judge.md, output_contract.judge.md) exist and are properly formatted; include integration tests that invoke the judge flow.
  • Review the implementation of record_judge_verdict in goal_repo.rs (not in this diff) to ensure it uses parameterized queries and properly sanitizes inputs.
  • Add judge fields (judgePassed, judgeCompleteness, judgeFindings, judgeSummary, judgeEvaluatedRunId) to the makeGoalPayload helper in agent-commands.test.ts.
  • Write a unit test for goalGetState that mocks a full payload including judge fields and asserts they are stored in the GoalStoreState correctly.
  • Consider adding a unit test for validate_slug to explicitly check that 'judge' is reserved.

Potential Risks

  • Without migration, the application will crash on goal queries.
  • Paused or budget-limited goals could be inadvertently completed by the Judge.
  • Frontend may display errors or missing data for goal elapsed time.
  • Compilation failure from removed Rust functions if any call site remains outside the diff.
  • Runtime undefined errors in goal-time UI components if timeUsedSeconds is still read.
  • The migration SQL in the test may not match the actual migration file, causing the test to pass while the real migration is incomplete.
  • The judge verdict recording function is not part of this batch; if improperly implemented, it could be susceptible to SQL injection or data corruption.
  • If the frontend store update logic silently drops judge fields because of a missing selector or incorrect field name, users will never see the verified status or findings in the UI, even though the backend correctly persists them.

Test Suggestions

  • Write integration tests for the new judge execution flow (positive/negative).
  • Test goal_repo::record_judge_verdict with both passed=true and false.
  • Add an integration test that triggers a full judge cycle: work → agent_judge invocation → record_judge_verdict pass/fail.
  • Ensure cargo test --manifest-path src-tauri/Cargo.toml passes with the new goal_lifecycle tests.
  • Run npm run test:unit to confirm frontend helper changes do not break existing tests.
  • Add integration tests for the record_judge_verdict function with malicious inputs to verify safe data handling.
  • GoalStoreState update: test that when goalGetState returns a payload with judgePassed: true and judgeCompleteness: 100, the corresponding store slice reflects those values.
  • Subagent slug reservation: add a test in subagent.rs that validate_slug("judge") returns an error (already part of the reserved list).

File-Level Coverage Notes

  • src-tauri/src/core/agent_session_execution.rs: Major new judge execution logic added, old goal_scored removed. The judge flow is well-structured but lacks an active-goal guard. Otherwise appears correct. (The judge tool guard in execute_helper_tool properly prevents subagent from calling it.)
  • src-tauri/src/core/agent_session.rs: Injection logic for agent_judge tool is correct for verified/unverified goals, but does not consider active status, which could be later refined.
  • src-tauri/src/core/agent_session_tools.rs: Removed old goal_scored tool definition and added mapping for Judge profile and model role. Safe changes.
  • src-tauri/src/core/agent_run_manager.rs: Removed planning-run time accounting and goal state event emission. The removal aligns with new judge-based system but may lose accumulated time for frontend. (The goal_repo import was removed as no longer needed.)
  • src-tauri/src/core/agent_run_event_handler.rs: All pause tracking calls removed, simplifying event handling. No longer needed with new goal model.
  • src-tauri/src/commands/agent.rs: Removed pause time accounting before pausing goal; consistent with new logic.
  • src-tauri/src/core/app_state.rs: Removed pause tracking state and tests. Clean removal.
  • src-tauri/src/core/goal_manager.rs: Adapted prompt templates and evaluation logic to use agent_judge. Removed old complete verdict and accounting. Challenge prompt now includes continuation. Good. (The warning for Complete !judge_passed is a nice fallback.)
  • src-tauri/src/persistence/repo/goal_repo.rs: Schema updated to include judge columns and drop time_used_seconds. Transactional verdict recording is safe. However, missing migration is critical.
  • src-tauri/src/model/goal.rs: Model updated to new fields; removal of time_used_seconds may break frontend.
  • src-tauri/src/core/subagent/runtime_orchestration.rs: Added Judge variant with proper read-only tools, delegation constraints, and tests. Bumped builtin max depth to 5. Well-implemented. (The depth increase is global; confirm it doesn't enable unintended deep chains.)
  • src-tauri/src/core/subagent/orchestrator.rs: Added prohibition of Judge as a delegation target and mapping for PromptSurface. Minor test update.
  • src-tauri/src/core/subagent/judge_contract.rs: New file with robust parsing, normalization, and tests. Good defensive design. (Excellent fallback when JSON cannot be parsed – never returns passed without evidence.)
  • src-tauri/src/core/subagent/mod.rs: Added judge_contract module.
  • src-tauri/src/gateway/gateway_runner.rs: Updated kickoff prompt to reference agent_judge. No risk.
  • src-tauri/src/ipc/frontend_channels.rs: Comment update only.
  • src-tauri/src/core/prompt/sources/custom_subagent_body.rs: No tests exist for this file; the new Judge body sourcing logic is not independently tested, which is acceptable for template‐based code.
  • src-tauri/src/core/prompt/sources/subagent_output_contract.rs: No tests for the new Judge output contract template; low risk due to simple template inclusion.
  • src-tauri/src/core/prompt/surface.rs: No tests in this file; the addition of SubagentJudge variant is covered by pattern-matching tests in surface_extensions.rs.
  • src-tauri/src/core/prompt/surface_extensions.rs: Tests were updated to include the Judge variant in the surface extensions, covering pattern matching and is_subagent checks.
  • ... and 11 more file-level entries.

Inline Downgraded Items (processed but not inline)

  • None

Coverage Status

  • Target files: 31
  • Covered files: 31
  • Uncovered files: 0
  • No-patch/binary covered as file-level: 0
  • Findings with unknown confidence (N/A): 0

Uncovered list:

  • None

No-patch covered list:

  • None

Runtime/Budget

  • Rounds used: 2/4
  • Planned batches: 2
  • Executed batches: 2
  • Sub-agent runs: 5
  • Planner calls: 2
  • Reviewer calls: 6
  • Model calls: 8/64
  • Structured-output summary-only degradation: NO

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 4
  • Findings with unknown confidence: 0
  • Inline comments attempted: 4
  • Target files: 24
  • Covered files: 24
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

export type GoalEvaluateResult = {
goal: GoalPayload;
verdict: "continue" | "challenge_evidence" | "complete" | "paused" | "budget_limited";
verdict: "continue" | "challenge_evidence" | "paused" | "budget_limited" | "skipped";

This comment was marked as outdated.

case "paused": return "goal.status.paused";
case "budget_limited": return "goal.status.budgetLimited";
case "complete": return "goal.status.complete";
case "complete": return goal.judgePassed ? "goal.status.verified" : "goal.status.complete";

This comment was marked as outdated.

pauseDetail?: string | null;
evidence?: string | null;
lastEvaluatedRunId?: string | null;
judgePassed?: boolean;

This comment was marked as outdated.

pauseDetail?: string | null;
evidence?: string | null;
lastEvaluatedRunId?: string | null;
judgePassed?: boolean;

This comment was marked as outdated.

jorben added 2 commits June 7, 2026 12:47
Update the feature descriptions and reorder the bullet points in both
README.md and README_zh.md to better reflect the current product
capabilities and improve readability. Changes include:

- Reordering features to highlight persistent goal management, real-time
  streaming, and extensibility earlier in the list
- Updating descriptions for several features to be more accurate
- Maintaining consistency between English and Chinese versions
- Keeping the overall structure while improving flow

These are documentation-only changes that do not affect functionality.
- Extract inline status key resolution into a pure exported function
  so the complete→verified (judgePassed) branch can be unit-tested
  without mounting the component
- Add unit tests covering all status mappings and judgePassed variants
- Add test for skipped verdict passthrough in goalEvaluate

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 4
  • Findings with unknown confidence: 0
  • Inline comments attempted: 4
  • Target files: 26
  • Covered files: 26
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

},
}))
}
Some(SubagentProfile::Judge) => {

This comment was marked as outdated.

(EXPLORE_TEMPLATE_REL_PATH, EXPLORE_TEMPLATE_EMBEDDED)
}
Some(SubagentProfile::Review) => (REVIEW_TEMPLATE_REL_PATH, REVIEW_TEMPLATE_EMBEDDED),
Some(SubagentProfile::Judge) => (JUDGE_TEMPLATE_REL_PATH, JUDGE_TEMPLATE_EMBEDDED),

This comment was marked as outdated.

RuntimeOrchestrationTool::Parallel => {
return Err("agent_parallel cannot be used as an individual helper".to_string());
}
RuntimeOrchestrationTool::Judge => {

This comment was marked as outdated.

return Err("agent_parallel cannot be used as an individual helper".to_string());
}

if tool == RuntimeOrchestrationTool::Judge {

This comment was marked as outdated.

jorben added 2 commits June 7, 2026 14:59
Raise `BUILTIN_DEFAULT_MAX_DELEGATION_DEPTH` from 3 to 5 to match the
existing `GLOBAL_MAX_DELEGATION_DEPTH`, allowing built-in subagents
(explore/review) to be delegated to the same depth as custom profiles.

Update delegation validation tests to reflect the new depth limits.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 5
  • Findings with unknown confidence: 0
  • Inline comments attempted: 5
  • Target files: 26
  • Covered files: 26
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

/// Run the main-agent-only `agent_judge` acceptance flow: build a Judge task
/// with the current goal injected, run the Judge helper, parse its structured
/// verdict, persist it, and (on pass) flip the goal to verified/complete.
async fn execute_judge_tool(

This comment was marked as outdated.

export type GoalEvaluateResult = {
goal: GoalPayload;
verdict: "continue" | "challenge_evidence" | "complete" | "paused" | "budget_limited";
verdict: "continue" | "challenge_evidence" | "paused" | "budget_limited" | "skipped";

This comment was marked as outdated.

/// Run the main-agent-only `agent_judge` acceptance flow: build a Judge task
/// with the current goal injected, run the Judge helper, parse its structured
/// verdict, persist it, and (on pass) flip the goal to verified/complete.
async fn execute_judge_tool(

This comment was marked as outdated.

custom_subagent_tools,
);

// Inject the main-agent-only `agent_judge` acceptance tool on demand: only

This comment was marked as outdated.

},
}))
}
Some(SubagentProfile::Judge) => {

This comment was marked as outdated.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 0
  • Findings with unknown confidence: 0
  • Inline comments attempted: 1
  • Target files: 26
  • Covered files: 26
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

@@ -601,6 +601,34 @@ pub async fn build_session_spec(
.await

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review completed for this PR diff. No concrete inline issue was selected after aggregation.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 2
  • Findings with unknown confidence: 0
  • Inline comments attempted: 2
  • Target files: 31
  • Covered files: 31
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

const SELECT_COLUMNS: &str = "id, thread_id, objective, status, token_budget, tokens_used, \
time_used_seconds, turns_used, max_turns, pause_reason, pause_detail, evidence, \
last_evaluated_run_id, created_at, updated_at";
turns_used, max_turns, pause_reason, pause_detail, evidence, \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL] Missing migration for new judge columns and removed time_used_seconds

The GoalRepo select and insert statements now reference columns that do not exist unless a database migration has been applied. No migration file is present in this batch; running against an existing database will cause sqlx errors.

Suggestion: Add an sqlx migration that drops time_used_seconds and adds the five judge columns with appropriate defaults (judge_passed=0, others NULL). Ensure it is applied before the new code runs.

Risk: Application will fail to start or crash during goal reads/writes if the migration is missing.

Confidence: 0.90

[From SubAgent: general]

.ok();
return AgentToolResult::text(result_text);
}
if goal.status == crate::model::goal::GoalStatus::Complete && goal.judge_passed {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Judge tool does not require goal to be Active

A paused, budget-limited, or otherwise non-Active goal can be accepted by the Judge and marked complete via agent_judge, bypassing the lifecycle rules that previously prevented completion when not active.

Suggestion: Add a check at the start of execute_judge_tool to reject the call if goal.status is not Active (or at least not Active or Paused with a valid reason). Alternatively, require the goal to be Active for acceptance.

Risk: Goals that were paused or limited may be prematurely completed, leading to incorrect state transitions and confusing user experience.

Confidence: 0.95

[From SubAgent: general]

@jorben jorben merged commit 6c0f5aa into master Jun 7, 2026
4 checks passed
@jorben jorben deleted the refact/goal-llm-judugement branch June 7, 2026 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants