Skip to content

MCP T18 — Incremental indexing (file-hash based) #665

@DvirDukhan

Description

@DvirDukhan

Context

Today, every call to Project.analyze_sources() re-analyzes the entire repo from scratch. For a 5,000-file codebase this is the difference between an interactive workflow and a coffee break — and after the MCP server lands, agents will be calling index_repo repeatedly during a session as they edit code.

This ticket adds file-hash-based incremental indexing: track per-file content hashes in Redis, diff against the current state on each index call, and only re-analyze changed files.

Builds on T17's per-branch storage (each branch tracks its own hash map).

Scope (in)

  1. Hash storage — track per-file content hashes in Redis under {repo}:{branch}_files (a Redis hash, field per file path → SHA256). Persisted at the end of every full or incremental index.
  2. Project.analyze_sources(incremental=True) — walk the file tree, compute current hashes, diff against stored hashes:
    • Unchanged files → skip the analyzer entirely
    • Modified files → call existing delete_files() to remove old graph entities for this file, then re-run the analyzer (first pass) on just these files
    • Deleted files → call delete_files() only
    • New files → analyze normally
  3. Second pass (LSP symbol resolution) — for v1, safe correctness wins: if any file changed, run the second pass over the entire branch graph. Per-file second-pass optimization is deferred.
  4. Persist the new hash map to Redis at the end (atomic — old map stays until new one is fully written).
  5. Project API — expose was_incremental: bool and files_changed: list[str] for callers.
  6. CLIcgraph index . defaults to incremental when a graph already exists for (project, branch); new --full flag forces a full re-index.
  7. MCP toolindex_repo(..., incremental=True) is the default (consumed by [MCP T4] index_repo MCP tool #652 T4); response includes mode: \"full\"|\"incremental\" and files_changed: list[str].

Edge cases handled

  • First-time indexing of a branch → falls back to full
  • Hash store missing or corrupted → falls back to full with a warning logged to stderr
  • File renames → treated as delete + add (rename detection deferred to Phase 2)
  • Aborted previous run leaving stale hashes → next full run overwrites

Scope (out)

  • Per-file second-pass / LSP optimization (Phase 2).
  • Rename detection (Phase 2).
  • Cross-branch incremental (each branch has its own hash store).
  • Watching the filesystem for changes (this is pull-based; user/agent calls index_repo).

Files

  • modified api/project.py (new incremental flag, hash diff orchestration, was_incremental / files_changed attributes)
  • modified api/info.py (new file-hash get/set helpers under {repo}:{branch}_files)
  • modified api/analyzers/source_analyzer.py (incremental orchestration over the changed-file set)
  • modified api/cli.py (--full flag on index and index-repo)
  • modified api/mcp/tools/structural.py (consume incremental flag; report mode + files_changed)
  • new tests/test_incremental_indexing.py

Acceptance criteria

  • Index fixture → re-index with no changes → second run reports mode=incremental, files_changed=[] and is significantly faster (assert via analyzer-call-count, not wall clock).
  • Modify one file → re-index → only that file's entities are deleted+re-added; other entities untouched (verify by node-id snapshot diff).
  • Delete a file → re-index → its entities are removed from the graph.
  • Add a new file → re-index → its entities appear.
  • First run on a fresh branch automatically falls back to full (no hash store yet).
  • --full CLI flag forces full re-index even when graph exists.
  • Corrupted hash store → falls back to full with a warning logged.
  • MCP index_repo integration test exercises an unchanged → modified → deleted → added sequence end-to-end.
  • Existing full-index tests still pass (incremental is opt-in at the API level, even if CLI defaults to it).

Dependencies

Notes for the implementer

  • Use SHA256 over file bytes (not mtime) — mtime is unreliable across git checkouts and CI environments.
  • The hash diff should be the only place that decides what to re-analyze. Don't sprinkle incremental logic deep into individual analyzers; orchestrate it in source_analyzer.py.
  • Be careful with delete_files() — it must remove all graph entities tied to a file (Functions, Classes, edges) without leaving orphans. Verify with a node-count assertion in the test.
  • The second-pass-over-everything decision is intentional for v1. Don't try to be clever here; the goal is correctness, and the first pass is where most of the win is.
  • When the hash store is missing/corrupted, log clearly to stderr so users notice and aren't surprised by a slow "incremental" run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmcpMCP server (model context protocol) work

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions