feat(goal): ✨ 引入 Judge 验收 Agent 替代 goal_scored 自证 by HayWolf · Pull Request #224 · tiylabs/tiycode

HayWolf · 2026-06-07T04:41:01Z

概述

将 goal 完成判定从主 agent 自证（goal_scored 工具）改为独立 Judge subagent（agent_judge）验收，消除主 agent "既是运动员又是裁判"的信任缺陷。

核心变更

Judge Subagent 体系

新增 agent_judge 子代理配置：depth=2、只读文件工具 + 诊断 shell（git_status/git_diff/term_status/term_output），禁止所有写工具
新增 src-tauri/src/core/subagent/judge_contract.rs：JudgeRequest/JudgeReport 结构化协议，extract_judge_report 对解析失败/空摘要/越界/无 findings 均做安全兜底
新增 prompt 模板 templates/subagent/judge.md 与输出契约 output_contract.judge.md（version 1）

运行时编排

agent_session.rs：goal 存在且非 (Complete && judge_passed) 时注入 agent_judge
agent_session_execution.rs：新增 execute_judge_tool 完整流程（注入 goal 上下文 → 运行 Judge → 提取报告 → record_judge_verdict → passed 发 GoalCompleted 否则 GoalStateUpdated）
递归委派硬拒绝：resolve_delegation 对 Judge 返回 error，覆盖 parallel 路径

数据模型与持久化

GoalRecord/GoalDto/GoalPayload 新增 judge_passed/judge_completeness/judge_findings/judge_summary/judge_evaluated_run_id 五个字段
迁移 20260607000000_goal_judge_fields.sql：ALTER 5 列，存量 status='complete' 回填 judge_passed=1, completeness=100
goal_repo.rs：新增事务型 record_judge_verdict（passed 时间事务写 status=complete + evidence=summary）

Goal 管理器

删除 GoalVerdict::Complete 变体、GOAL_SCORED_* 常量、MISSING_EVIDENCE_PROMPT、ChallengePromptVariant
evaluate_after_run：Complete && judge_passed 停续行，Complete && !judge_passed 记 warning 仍停
续行 prompt 改为引用 agent_judge 并追加最近 findings

清理

删除 goal_scored 工具定义与执行分派（运行时/前端/gateway 零引用）
移除 GoalManager::mark_complete（直写 status=complete 不设 judge_passed，产出非法组合，无生产调用）
修复 BuiltinSubagent/AnySubagent doc comment 遗漏 judge
前端 GoalEvaluateResult.verdict 删除死成员 "complete"

动机

主 agent 通过 goal_scored 自证 goal 完成存在显著信任缺陷——AI 可能给出自我感觉良好的评分。引入独立 Judge subagent 实现"验收方与执行方分离"，Judge 需基于实际文件变更、git diff 和终端输出做出判断，无法仅凭对话上下文敷衍。

设计文档：docs/goal-judge-evaluation-refactor.md

测试

后端测试（全部通过）

goal_lifecycle：28 个测试（含 record_judge_verdict pass/fail、migration backfill、evaluate 各分支）
subagent 模块：73 个测试（含 judge_contract 10 个、runtime_orchestration Judge 4 个）
全量：cargo test --locked 全部通过

前端检查

npm run typecheck：无错误
npm run test:unit：840 passed，1 skipped

格式

cargo fmt --check：干净

破坏性变更

无。存量 status=complete goal 已通过迁移回填 judge_passed=1；前端 GoalEvaluateResult 删除的 "complete" 字面量后端从不产出。

关联

Refs docs/goal-judge-evaluation-refactor.md

检查清单

代码风格通过 cargo fmt --check
后端测试 cargo test --locked 全部通过
前端 npm run typecheck 无错误
前端 npm run test:unit 全部通过
已清理死代码（goal_scored、mark_complete、GoalVerdict::Complete、前端 "complete" 字面量）
文档注释已更新（BuiltinSubagent/AnySubagent 补 judge）

🤖 Generated with TiyCode

…udge acceptance agent Remove the `goal_scored` tool that allowed the main agent to self-attest goal completion, replacing it with an `agent_judge` built-in subagent that independently verifies goal attainment against the project's current state. Key changes: - Add `SubagentProfile::Judge` with read-only file tools and diagnostic-only shell (soft constraint via prompt) - Add `JudgeReport` structured contract (passed, completeness_pct, findings, summary) with safe fallback parsing - Add `agent_judge` tool injection only for the main agent when an unverified goal exists; runtime gate blocks subagent/parallel recursion into Judge - Add DB migration for `judge_passed`, `judge_completeness`, `judge_findings`, `judge_summary`, `judge_evaluated_run_id` columns with backfill for legacy `status='complete'` goals - Replace continuation stop condition: `Complete && judge_passed` instead of `goal_scored`-driven status flip - Rewrite continuation prompt to instruct main agent to call `agent_judge` and follow findings on rejection - Add Judge prompt surface, templates, and output contract - Update `active_goal.tpl.md` to reflect Judge acceptance flow - Extend goal lifecycle tests for Judge pass/fail/legacy compat

Remove the mark_complete pathway from goals as completion will be handled through a different mechanism: - Remove mark_complete method from GoalManager - Remove "complete" from GoalEvaluateResult verdict type - Remove mark_complete test cases (evidence validation, etc.) - Update subagent surface comments to include judge BREAKING CHANGE: GoalEvaluateResult.verdict no longer includes "complete"

github-actions · 2026-06-07T04:47:15Z

AI Code Review Summary

PR: #224 (feat(goal): ✨ 引入 Judge 验收 Agent 替代 goal_scored 自证)
Preferred language: English

Overall Assessment

Detected 2 actionable findings, prioritize CRITICAL/HIGH before merge.

Major Findings by Severity

CRITICAL (1)
- src-tauri/src/persistence/repo/goal_repo.rs:8 - Missing migration for new judge columns and removed time_used_seconds
HIGH (1)
- src-tauri/src/core/agent_session_execution.rs:1668 - Judge tool does not require goal to be Active

Actionable Suggestions

Add a database migration for the new goal schema (drop time_used_seconds, add judge_* columns).
Add an 'Active' status guard in execute_judge_tool to prevent acceptance of non-active goals.
Coordinate with frontend to update goal UI to handle the removal of time_used_seconds.
Run cargo check and npm run typecheck on the full workspace to catch any broken references from the removals of get_run_elapsed_seconds, get_active_run_elapsed_seconds, and timeUsedSeconds.
Search for timeUsedSeconds in src/**/*.{ts,tsx} and remove or update all accesses.
Verify that the Judge subagent templates (judge.md, output_contract.judge.md) exist and are properly formatted; include integration tests that invoke the judge flow.
Review the implementation of record_judge_verdict in goal_repo.rs (not in this diff) to ensure it uses parameterized queries and properly sanitizes inputs.
Add judge fields (judgePassed, judgeCompleteness, judgeFindings, judgeSummary, judgeEvaluatedRunId) to the makeGoalPayload helper in agent-commands.test.ts.
Write a unit test for goalGetState that mocks a full payload including judge fields and asserts they are stored in the GoalStoreState correctly.
Consider adding a unit test for validate_slug to explicitly check that 'judge' is reserved.

Potential Risks

Without migration, the application will crash on goal queries.
Paused or budget-limited goals could be inadvertently completed by the Judge.
Frontend may display errors or missing data for goal elapsed time.
Compilation failure from removed Rust functions if any call site remains outside the diff.
Runtime undefined errors in goal-time UI components if timeUsedSeconds is still read.
The migration SQL in the test may not match the actual migration file, causing the test to pass while the real migration is incomplete.
The judge verdict recording function is not part of this batch; if improperly implemented, it could be susceptible to SQL injection or data corruption.
If the frontend store update logic silently drops judge fields because of a missing selector or incorrect field name, users will never see the verified status or findings in the UI, even though the backend correctly persists them.

Test Suggestions

Write integration tests for the new judge execution flow (positive/negative).
Test goal_repo::record_judge_verdict with both passed=true and false.
Add an integration test that triggers a full judge cycle: work → agent_judge invocation → record_judge_verdict pass/fail.
Ensure cargo test --manifest-path src-tauri/Cargo.toml passes with the new goal_lifecycle tests.
Run npm run test:unit to confirm frontend helper changes do not break existing tests.
Add integration tests for the record_judge_verdict function with malicious inputs to verify safe data handling.
GoalStoreState update: test that when goalGetState returns a payload with judgePassed: true and judgeCompleteness: 100, the corresponding store slice reflects those values.
Subagent slug reservation: add a test in subagent.rs that validate_slug("judge") returns an error (already part of the reserved list).

File-Level Coverage Notes

src-tauri/src/core/agent_session_execution.rs: Major new judge execution logic added, old goal_scored removed. The judge flow is well-structured but lacks an active-goal guard. Otherwise appears correct. (The judge tool guard in execute_helper_tool properly prevents subagent from calling it.)
src-tauri/src/core/agent_session.rs: Injection logic for agent_judge tool is correct for verified/unverified goals, but does not consider active status, which could be later refined.
src-tauri/src/core/agent_session_tools.rs: Removed old goal_scored tool definition and added mapping for Judge profile and model role. Safe changes.
src-tauri/src/core/agent_run_manager.rs: Removed planning-run time accounting and goal state event emission. The removal aligns with new judge-based system but may lose accumulated time for frontend. (The goal_repo import was removed as no longer needed.)
src-tauri/src/core/agent_run_event_handler.rs: All pause tracking calls removed, simplifying event handling. No longer needed with new goal model.
src-tauri/src/commands/agent.rs: Removed pause time accounting before pausing goal; consistent with new logic.
src-tauri/src/core/app_state.rs: Removed pause tracking state and tests. Clean removal.
src-tauri/src/core/goal_manager.rs: Adapted prompt templates and evaluation logic to use agent_judge. Removed old complete verdict and accounting. Challenge prompt now includes continuation. Good. (The warning for Complete !judge_passed is a nice fallback.)
src-tauri/src/persistence/repo/goal_repo.rs: Schema updated to include judge columns and drop time_used_seconds. Transactional verdict recording is safe. However, missing migration is critical.
src-tauri/src/model/goal.rs: Model updated to new fields; removal of time_used_seconds may break frontend.
src-tauri/src/core/subagent/runtime_orchestration.rs: Added Judge variant with proper read-only tools, delegation constraints, and tests. Bumped builtin max depth to 5. Well-implemented. (The depth increase is global; confirm it doesn't enable unintended deep chains.)
src-tauri/src/core/subagent/orchestrator.rs: Added prohibition of Judge as a delegation target and mapping for PromptSurface. Minor test update.
src-tauri/src/core/subagent/judge_contract.rs: New file with robust parsing, normalization, and tests. Good defensive design. (Excellent fallback when JSON cannot be parsed – never returns passed without evidence.)
src-tauri/src/core/subagent/mod.rs: Added judge_contract module.
src-tauri/src/gateway/gateway_runner.rs: Updated kickoff prompt to reference agent_judge. No risk.
src-tauri/src/ipc/frontend_channels.rs: Comment update only.
src-tauri/src/core/prompt/sources/custom_subagent_body.rs: No tests exist for this file; the new Judge body sourcing logic is not independently tested, which is acceptable for template‐based code.
src-tauri/src/core/prompt/sources/subagent_output_contract.rs: No tests for the new Judge output contract template; low risk due to simple template inclusion.
src-tauri/src/core/prompt/surface.rs: No tests in this file; the addition of SubagentJudge variant is covered by pattern-matching tests in surface_extensions.rs.
src-tauri/src/core/prompt/surface_extensions.rs: Tests were updated to include the Judge variant in the surface extensions, covering pattern matching and is_subagent checks.
... and 11 more file-level entries.

Inline Downgraded Items (processed but not inline)

None

Coverage Status

Target files: 31
Covered files: 31
Uncovered files: 0
No-patch/binary covered as file-level: 0
Findings with unknown confidence (N/A): 0

Uncovered list:

None

No-patch covered list:

None

Runtime/Budget

Rounds used: 2/4
Planned batches: 2
Executed batches: 2
Sub-agent runs: 5
Planner calls: 2
Reviewer calls: 6
Model calls: 8/64
Structured-output summary-only degradation: NO

github-actions

Automated PR review completed.

Findings kept: 4
Findings with unknown confidence: 0
Inline comments attempted: 4
Target files: 24
Covered files: 24
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

 export type GoalEvaluateResult = {
  goal: GoalPayload;
-  verdict: "continue" | "challenge_evidence" | "complete" | "paused" | "budget_limited";
+  verdict: "continue" | "challenge_evidence" | "paused" | "budget_limited" | "skipped";


Sign in to view

      case "paused": return "goal.status.paused";
      case "budget_limited": return "goal.status.budgetLimited";
-      case "complete": return "goal.status.complete";
+      case "complete": return goal.judgePassed ? "goal.status.verified" : "goal.status.complete";


Sign in to view

  pauseDetail?: string | null;
  evidence?: string | null;
  lastEvaluatedRunId?: string | null;
+  judgePassed?: boolean;


Sign in to view

  pauseDetail?: string | null;
  evidence?: string | null;
  lastEvaluatedRunId?: string | null;
+  judgePassed?: boolean;


Update the feature descriptions and reorder the bullet points in both README.md and README_zh.md to better reflect the current product capabilities and improve readability. Changes include: - Reordering features to highlight persistent goal management, real-time streaming, and extensibility earlier in the list - Updating descriptions for several features to be more accurate - Maintaining consistency between English and Chinese versions - Keeping the overall structure while improving flow These are documentation-only changes that do not affect functionality.

- Extract inline status key resolution into a pure exported function so the complete→verified (judgePassed) branch can be unit-tested without mounting the component - Add unit tests covering all status mappings and judgePassed variants - Add test for skipped verdict passthrough in goalEvaluate

github-actions

Automated PR review completed.

Findings kept: 4
Findings with unknown confidence: 0
Inline comments attempted: 4
Target files: 26
Covered files: 26
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

                    },
                }))
            }
+            Some(SubagentProfile::Judge) => {


Sign in to view

                (EXPLORE_TEMPLATE_REL_PATH, EXPLORE_TEMPLATE_EMBEDDED)
            }
            Some(SubagentProfile::Review) => (REVIEW_TEMPLATE_REL_PATH, REVIEW_TEMPLATE_EMBEDDED),
+            Some(SubagentProfile::Judge) => (JUDGE_TEMPLATE_REL_PATH, JUDGE_TEMPLATE_EMBEDDED),


Sign in to view

            RuntimeOrchestrationTool::Parallel => {
                return Err("agent_parallel cannot be used as an individual helper".to_string());
            }
+            RuntimeOrchestrationTool::Judge => {


Sign in to view

            return Err("agent_parallel cannot be used as an individual helper".to_string());
        }

+        if tool == RuntimeOrchestrationTool::Judge {


Raise `BUILTIN_DEFAULT_MAX_DELEGATION_DEPTH` from 3 to 5 to match the existing `GLOBAL_MAX_DELEGATION_DEPTH`, allowing built-in subagents (explore/review) to be delegated to the same depth as custom profiles. Update delegation validation tests to reflect the new depth limits.

github-actions

Automated PR review completed.

Findings kept: 5
Findings with unknown confidence: 0
Inline comments attempted: 5
Target files: 26
Covered files: 26
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

+    /// Run the main-agent-only `agent_judge` acceptance flow: build a Judge task
+    /// with the current goal injected, run the Judge helper, parse its structured
+    /// verdict, persist it, and (on pass) flip the goal to verified/complete.
+    async fn execute_judge_tool(


Sign in to view

 export type GoalEvaluateResult = {
  goal: GoalPayload;
-  verdict: "continue" | "challenge_evidence" | "complete" | "paused" | "budget_limited";
+  verdict: "continue" | "challenge_evidence" | "paused" | "budget_limited" | "skipped";


Sign in to view

+    /// Run the main-agent-only `agent_judge` acceptance flow: build a Judge task
+    /// with the current goal injected, run the Judge helper, parse its structured
+    /// verdict, persist it, and (on pass) flip the goal to verified/complete.
+    async fn execute_judge_tool(


Sign in to view

+        custom_subagent_tools,
+    );
+
+    // Inject the main-agent-only `agent_judge` acceptance tool on demand: only


Sign in to view

                    },
                }))
            }
+            Some(SubagentProfile::Judge) => {


github-actions

Automated PR review completed.

Findings kept: 0
Findings with unknown confidence: 0
Inline comments attempted: 1
Target files: 26
Covered files: 26
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

github-actions · 2026-06-07T07:09:08Z

@@ -601,6 +601,34 @@ pub async fn build_session_spec(
        .await


Automated review completed for this PR diff. No concrete inline issue was selected after aggregation.

…idelines

…n-level elapsed tracking

github-actions

Automated PR review completed.

Findings kept: 2
Findings with unknown confidence: 0
Inline comments attempted: 2
Target files: 31
Covered files: 31
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

github-actions · 2026-06-07T10:01:20Z

 const SELECT_COLUMNS: &str = "id, thread_id, objective, status, token_budget, tokens_used, \
-    time_used_seconds, turns_used, max_turns, pause_reason, pause_detail, evidence, \
-    last_evaluated_run_id, created_at, updated_at";
+    turns_used, max_turns, pause_reason, pause_detail, evidence, \


[CRITICAL] Missing migration for new judge columns and removed time_used_seconds

The GoalRepo select and insert statements now reference columns that do not exist unless a database migration has been applied. No migration file is present in this batch; running against an existing database will cause sqlx errors.

Suggestion: Add an sqlx migration that drops time_used_seconds and adds the five judge columns with appropriate defaults (judge_passed=0, others NULL). Ensure it is applied before the new code runs.

Risk: Application will fail to start or crash during goal reads/writes if the migration is missing.

Confidence: 0.90

[From SubAgent: general]

github-actions · 2026-06-07T10:01:20Z

-                    .ok();
-                    return AgentToolResult::text(result_text);
-                }
+        if goal.status == crate::model::goal::GoalStatus::Complete && goal.judge_passed {


[HIGH] Judge tool does not require goal to be Active

A paused, budget-limited, or otherwise non-Active goal can be accepted by the Judge and marked complete via agent_judge, bypassing the lifecycle rules that previously prevented completion when not active.

Suggestion: Add a check at the start of execute_judge_tool to reject the call if goal.status is not Active (or at least not Active or Paused with a valid reason). Alternatively, require the goal to be Active for acceptance.

Risk: Goals that were paused or limited may be prematurely completed, leading to incorrect state transitions and confusing user experience.

Confidence: 0.95

[From SubAgent: general]

jorben added 2 commits June 7, 2026 11:54

github-actions Bot reviewed Jun 7, 2026

View reviewed changes

jorben added 2 commits June 7, 2026 12:47

github-actions Bot reviewed Jun 7, 2026

View reviewed changes

jorben added 2 commits June 7, 2026 14:59

docs: 📝 remove obsolete design document

c15e885

github-actions Bot reviewed Jun 7, 2026

View reviewed changes

jorben added 2 commits June 7, 2026 17:03

docs(judge): 📝 add size-first verification strategy and delegation gu…

d60daec

…idelines

refactor(goal): ♻️ remove goal-level time_used_seconds in favor of ru…

dc8fca0

…n-level elapsed tracking

github-actions Bot reviewed Jun 7, 2026

View reviewed changes

jorben merged commit 6c0f5aa into master Jun 7, 2026
4 checks passed

jorben deleted the refact/goal-llm-judugement branch June 7, 2026 12:11

Conversation

HayWolf commented Jun 7, 2026

概述

核心变更

动机

测试

后端测试（全部通过）

前端检查

格式

破坏性变更

关联

检查清单

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Code Review Summary

Overall Assessment

Major Findings by Severity

Actionable Suggestions

Potential Risks

Test Suggestions

File-Level Coverage Notes

Inline Downgraded Items (processed but not inline)

Coverage Status

Runtime/Budget

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

github-actions Bot commented Jun 7, 2026 •

edited

Loading