feat(goal): ✨ 引入 Judge 验收 Agent 替代 goal_scored 自证#224
Conversation
…udge acceptance agent Remove the `goal_scored` tool that allowed the main agent to self-attest goal completion, replacing it with an `agent_judge` built-in subagent that independently verifies goal attainment against the project's current state. Key changes: - Add `SubagentProfile::Judge` with read-only file tools and diagnostic-only shell (soft constraint via prompt) - Add `JudgeReport` structured contract (passed, completeness_pct, findings, summary) with safe fallback parsing - Add `agent_judge` tool injection only for the main agent when an unverified goal exists; runtime gate blocks subagent/parallel recursion into Judge - Add DB migration for `judge_passed`, `judge_completeness`, `judge_findings`, `judge_summary`, `judge_evaluated_run_id` columns with backfill for legacy `status='complete'` goals - Replace continuation stop condition: `Complete && judge_passed` instead of `goal_scored`-driven status flip - Rewrite continuation prompt to instruct main agent to call `agent_judge` and follow findings on rejection - Add Judge prompt surface, templates, and output contract - Update `active_goal.tpl.md` to reflect Judge acceptance flow - Extend goal lifecycle tests for Judge pass/fail/legacy compat
Remove the mark_complete pathway from goals as completion will be handled through a different mechanism: - Remove mark_complete method from GoalManager - Remove "complete" from GoalEvaluateResult verdict type - Remove mark_complete test cases (evidence validation, etc.) - Update subagent surface comments to include judge BREAKING CHANGE: GoalEvaluateResult.verdict no longer includes "complete"
AI Code Review SummaryPR: #224 (feat(goal): ✨ 引入 Judge 验收 Agent 替代 goal_scored 自证) Overall AssessmentDetected 2 actionable findings, prioritize CRITICAL/HIGH before merge. Major Findings by Severity
Actionable Suggestions
Potential Risks
Test Suggestions
File-Level Coverage Notes
Inline Downgraded Items (processed but not inline)
Coverage Status
Uncovered list:
No-patch covered list:
Runtime/Budget
|
| export type GoalEvaluateResult = { | ||
| goal: GoalPayload; | ||
| verdict: "continue" | "challenge_evidence" | "complete" | "paused" | "budget_limited"; | ||
| verdict: "continue" | "challenge_evidence" | "paused" | "budget_limited" | "skipped"; |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| case "paused": return "goal.status.paused"; | ||
| case "budget_limited": return "goal.status.budgetLimited"; | ||
| case "complete": return "goal.status.complete"; | ||
| case "complete": return goal.judgePassed ? "goal.status.verified" : "goal.status.complete"; |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| pauseDetail?: string | null; | ||
| evidence?: string | null; | ||
| lastEvaluatedRunId?: string | null; | ||
| judgePassed?: boolean; |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| pauseDetail?: string | null; | ||
| evidence?: string | null; | ||
| lastEvaluatedRunId?: string | null; | ||
| judgePassed?: boolean; |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
Update the feature descriptions and reorder the bullet points in both README.md and README_zh.md to better reflect the current product capabilities and improve readability. Changes include: - Reordering features to highlight persistent goal management, real-time streaming, and extensibility earlier in the list - Updating descriptions for several features to be more accurate - Maintaining consistency between English and Chinese versions - Keeping the overall structure while improving flow These are documentation-only changes that do not affect functionality.
- Extract inline status key resolution into a pure exported function so the complete→verified (judgePassed) branch can be unit-tested without mounting the component - Add unit tests covering all status mappings and judgePassed variants - Add test for skipped verdict passthrough in goalEvaluate
| }, | ||
| })) | ||
| } | ||
| Some(SubagentProfile::Judge) => { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| (EXPLORE_TEMPLATE_REL_PATH, EXPLORE_TEMPLATE_EMBEDDED) | ||
| } | ||
| Some(SubagentProfile::Review) => (REVIEW_TEMPLATE_REL_PATH, REVIEW_TEMPLATE_EMBEDDED), | ||
| Some(SubagentProfile::Judge) => (JUDGE_TEMPLATE_REL_PATH, JUDGE_TEMPLATE_EMBEDDED), |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| RuntimeOrchestrationTool::Parallel => { | ||
| return Err("agent_parallel cannot be used as an individual helper".to_string()); | ||
| } | ||
| RuntimeOrchestrationTool::Judge => { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| return Err("agent_parallel cannot be used as an individual helper".to_string()); | ||
| } | ||
|
|
||
| if tool == RuntimeOrchestrationTool::Judge { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
Raise `BUILTIN_DEFAULT_MAX_DELEGATION_DEPTH` from 3 to 5 to match the existing `GLOBAL_MAX_DELEGATION_DEPTH`, allowing built-in subagents (explore/review) to be delegated to the same depth as custom profiles. Update delegation validation tests to reflect the new depth limits.
| /// Run the main-agent-only `agent_judge` acceptance flow: build a Judge task | ||
| /// with the current goal injected, run the Judge helper, parse its structured | ||
| /// verdict, persist it, and (on pass) flip the goal to verified/complete. | ||
| async fn execute_judge_tool( |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| export type GoalEvaluateResult = { | ||
| goal: GoalPayload; | ||
| verdict: "continue" | "challenge_evidence" | "complete" | "paused" | "budget_limited"; | ||
| verdict: "continue" | "challenge_evidence" | "paused" | "budget_limited" | "skipped"; |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| /// Run the main-agent-only `agent_judge` acceptance flow: build a Judge task | ||
| /// with the current goal injected, run the Judge helper, parse its structured | ||
| /// verdict, persist it, and (on pass) flip the goal to verified/complete. | ||
| async fn execute_judge_tool( |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| custom_subagent_tools, | ||
| ); | ||
|
|
||
| // Inject the main-agent-only `agent_judge` acceptance tool on demand: only |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| }, | ||
| })) | ||
| } | ||
| Some(SubagentProfile::Judge) => { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| @@ -601,6 +601,34 @@ pub async fn build_session_spec( | |||
| .await | |||
There was a problem hiding this comment.
Automated review completed for this PR diff. No concrete inline issue was selected after aggregation.
…n-level elapsed tracking
| const SELECT_COLUMNS: &str = "id, thread_id, objective, status, token_budget, tokens_used, \ | ||
| time_used_seconds, turns_used, max_turns, pause_reason, pause_detail, evidence, \ | ||
| last_evaluated_run_id, created_at, updated_at"; | ||
| turns_used, max_turns, pause_reason, pause_detail, evidence, \ |
There was a problem hiding this comment.
[CRITICAL] Missing migration for new judge columns and removed time_used_seconds
The GoalRepo select and insert statements now reference columns that do not exist unless a database migration has been applied. No migration file is present in this batch; running against an existing database will cause sqlx errors.
Suggestion: Add an sqlx migration that drops time_used_seconds and adds the five judge columns with appropriate defaults (judge_passed=0, others NULL). Ensure it is applied before the new code runs.
Risk: Application will fail to start or crash during goal reads/writes if the migration is missing.
Confidence: 0.90
| .ok(); | ||
| return AgentToolResult::text(result_text); | ||
| } | ||
| if goal.status == crate::model::goal::GoalStatus::Complete && goal.judge_passed { |
There was a problem hiding this comment.
[HIGH] Judge tool does not require goal to be Active
A paused, budget-limited, or otherwise non-Active goal can be accepted by the Judge and marked complete via agent_judge, bypassing the lifecycle rules that previously prevented completion when not active.
Suggestion: Add a check at the start of execute_judge_tool to reject the call if goal.status is not Active (or at least not Active or Paused with a valid reason). Alternatively, require the goal to be Active for acceptance.
Risk: Goals that were paused or limited may be prematurely completed, leading to incorrect state transitions and confusing user experience.
Confidence: 0.95
概述
将 goal 完成判定从主 agent 自证(
goal_scored工具)改为独立 Judge subagent(agent_judge)验收,消除主 agent "既是运动员又是裁判"的信任缺陷。核心变更
Judge Subagent 体系
agent_judge子代理配置:depth=2、只读文件工具 + 诊断 shell(git_status/git_diff/term_status/term_output),禁止所有写工具src-tauri/src/core/subagent/judge_contract.rs:JudgeRequest/JudgeReport结构化协议,extract_judge_report对解析失败/空摘要/越界/无 findings 均做安全兜底templates/subagent/judge.md与输出契约output_contract.judge.md(version 1)运行时编排
agent_session.rs:goal 存在且非(Complete && judge_passed)时注入agent_judgeagent_session_execution.rs:新增execute_judge_tool完整流程(注入 goal 上下文 → 运行 Judge → 提取报告 →record_judge_verdict→ passed 发GoalCompleted否则GoalStateUpdated)resolve_delegation对 Judge 返回 error,覆盖 parallel 路径数据模型与持久化
judge_passed/judge_completeness/judge_findings/judge_summary/judge_evaluated_run_id五个字段20260607000000_goal_judge_fields.sql:ALTER 5 列,存量status='complete'回填judge_passed=1, completeness=100goal_repo.rs:新增事务型record_judge_verdict(passed 时间事务写status=complete + evidence=summary)Goal 管理器
GoalVerdict::Complete变体、GOAL_SCORED_*常量、MISSING_EVIDENCE_PROMPT、ChallengePromptVariantevaluate_after_run:Complete && judge_passed停续行,Complete && !judge_passed记 warning 仍停agent_judge并追加最近 findings清理
goal_scored工具定义与执行分派(运行时/前端/gateway 零引用)GoalManager::mark_complete(直写status=complete不设judge_passed,产出非法组合,无生产调用)BuiltinSubagent/AnySubagentdoc comment 遗漏 judgeGoalEvaluateResult.verdict删除死成员"complete"动机
主 agent 通过
goal_scored自证 goal 完成存在显著信任缺陷——AI 可能给出自我感觉良好的评分。引入独立 Judge subagent 实现"验收方与执行方分离",Judge 需基于实际文件变更、git diff 和终端输出做出判断,无法仅凭对话上下文敷衍。设计文档:
docs/goal-judge-evaluation-refactor.md测试
后端测试(全部通过)
goal_lifecycle:28 个测试(含record_judge_verdictpass/fail、migration backfill、evaluate 各分支)cargo test --locked全部通过前端检查
npm run typecheck:无错误npm run test:unit:840 passed,1 skipped格式
cargo fmt --check:干净破坏性变更
无。存量
status=completegoal 已通过迁移回填judge_passed=1;前端GoalEvaluateResult删除的"complete"字面量后端从不产出。关联
Refs
docs/goal-judge-evaluation-refactor.md检查清单
cargo fmt --checkcargo test --locked全部通过npm run typecheck无错误npm run test:unit全部通过goal_scored、mark_complete、GoalVerdict::Complete、前端"complete"字面量)BuiltinSubagent/AnySubagent补 judge)🤖 Generated with TiyCode