You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today, every call to Project.analyze_sources() re-analyzes the entire repo from scratch. For a 5,000-file codebase this is the difference between an interactive workflow and a coffee break — and after the MCP server lands, agents will be calling index_repo repeatedly during a session as they edit code.
This ticket adds file-hash-based incremental indexing: track per-file content hashes in Redis, diff against the current state on each index call, and only re-analyze changed files.
Builds on T17's per-branch storage (each branch tracks its own hash map).
Scope (in)
Hash storage — track per-file content hashes in Redis under {repo}:{branch}_files (a Redis hash, field per file path → SHA256). Persisted at the end of every full or incremental index.
Project.analyze_sources(incremental=True) — walk the file tree, compute current hashes, diff against stored hashes:
Unchanged files → skip the analyzer entirely
Modified files → call existing delete_files() to remove old graph entities for this file, then re-run the analyzer (first pass) on just these files
Deleted files → call delete_files() only
New files → analyze normally
Second pass (LSP symbol resolution) — for v1, safe correctness wins: if any file changed, run the second pass over the entire branch graph. Per-file second-pass optimization is deferred.
Persist the new hash map to Redis at the end (atomic — old map stays until new one is fully written).
Project API — expose was_incremental: bool and files_changed: list[str] for callers.
CLI — cgraph index . defaults to incremental when a graph already exists for (project, branch); new --full flag forces a full re-index.
MCP tool — index_repo(..., incremental=True) is the default (consumed by [MCP T4] index_repo MCP tool #652 T4); response includes mode: \"full\"|\"incremental\" and files_changed: list[str].
Edge cases handled
First-time indexing of a branch → falls back to full
Hash store missing or corrupted → falls back to full with a warning logged to stderr
File renames → treated as delete + add (rename detection deferred to Phase 2)
Aborted previous run leaving stale hashes → next full run overwrites
Index fixture → re-index with no changes → second run reports mode=incremental, files_changed=[] and is significantly faster (assert via analyzer-call-count, not wall clock).
Modify one file → re-index → only that file's entities are deleted+re-added; other entities untouched (verify by node-id snapshot diff).
Delete a file → re-index → its entities are removed from the graph.
Add a new file → re-index → its entities appear.
First run on a fresh branch automatically falls back to full (no hash store yet).
--full CLI flag forces full re-index even when graph exists.
Corrupted hash store → falls back to full with a warning logged.
MCP index_repo integration test exercises an unchanged → modified → deleted → added sequence end-to-end.
Existing full-index tests still pass (incremental is opt-in at the API level, even if CLI defaults to it).
Use SHA256 over file bytes (not mtime) — mtime is unreliable across git checkouts and CI environments.
The hash diff should be the only place that decides what to re-analyze. Don't sprinkle incremental logic deep into individual analyzers; orchestrate it in source_analyzer.py.
Be careful with delete_files() — it must remove all graph entities tied to a file (Functions, Classes, edges) without leaving orphans. Verify with a node-count assertion in the test.
The second-pass-over-everything decision is intentional for v1. Don't try to be clever here; the goal is correctness, and the first pass is where most of the win is.
When the hash store is missing/corrupted, log clearly to stderr so users notice and aren't surprised by a slow "incremental" run.
Context
Today, every call to
Project.analyze_sources()re-analyzes the entire repo from scratch. For a 5,000-file codebase this is the difference between an interactive workflow and a coffee break — and after the MCP server lands, agents will be callingindex_reporepeatedly during a session as they edit code.This ticket adds file-hash-based incremental indexing: track per-file content hashes in Redis, diff against the current state on each index call, and only re-analyze changed files.
Builds on T17's per-branch storage (each branch tracks its own hash map).
Scope (in)
{repo}:{branch}_files(a Redis hash, field per file path → SHA256). Persisted at the end of every full or incremental index.Project.analyze_sources(incremental=True)— walk the file tree, compute current hashes, diff against stored hashes:delete_files()to remove old graph entities for this file, then re-run the analyzer (first pass) on just these filesdelete_files()onlyProjectAPI — exposewas_incremental: boolandfiles_changed: list[str]for callers.cgraph index .defaults to incremental when a graph already exists for(project, branch); new--fullflag forces a full re-index.index_repo(..., incremental=True)is the default (consumed by [MCP T4] index_repo MCP tool #652 T4); response includesmode: \"full\"|\"incremental\"andfiles_changed: list[str].Edge cases handled
Scope (out)
Files
api/project.py(newincrementalflag, hash diff orchestration,was_incremental/files_changedattributes)api/info.py(new file-hash get/set helpers under{repo}:{branch}_files)api/analyzers/source_analyzer.py(incremental orchestration over the changed-file set)api/cli.py(--fullflag onindexandindex-repo)api/mcp/tools/structural.py(consume incremental flag; report mode + files_changed)tests/test_incremental_indexing.pyAcceptance criteria
mode=incremental, files_changed=[]and is significantly faster (assert via analyzer-call-count, not wall clock).--fullCLI flag forces full re-index even when graph exists.index_repointegration test exercises an unchanged → modified → deleted → added sequence end-to-end.Dependencies
Notes for the implementer
source_analyzer.py.delete_files()— it must remove all graph entities tied to a file (Functions, Classes, edges) without leaving orphans. Verify with a node-count assertion in the test.