Skip to content

MCP T15 — Tree-sitter analyzer base class refactor #663

@DvirDukhan

Description

@DvirDukhan

Context

Today, PythonAnalyzer, JavaScriptAnalyzer, and KotlinAnalyzer all use tree-sitter, but each is hand-rolled with duplicated parser setup, query helpers, and traversal logic. Adding a new tree-sitter language (T16: Go, Rust, TypeScript, Ruby, C++) means copying and editing ~300 lines per language. This refactor extracts the shared scaffolding into a base class so each new language is a small subclass declaring only what's actually language-specific.

This is a strictly non-functional refactor — existing analyzer behavior and graph outputs must be byte-identical.

Scope (in)

  1. New api/analyzers/tree_sitter_base.pyTreeSitterAnalyzer(AbstractAnalyzer) base class exposing hooks each subclass fills in:
    • language: tree_sitter.Language
    • node_type_to_label: dict[str, str] — tree-sitter node type → entity label
    • query_find_calls: str, query_find_classes: str, query_find_imports: str — tree-sitter query templates
    • extract_docstring(node) -> str | None
  2. Migrate the 3 existing analyzers onto the base class:
    • api/analyzers/python/analyzer.py
    • api/analyzers/javascript/analyzer.py
    • api/analyzers/kotlin/analyzer.py
  3. Documentation — base class docstring describes the contract subclasses must implement.
  4. Regression guard — new test that indexes a tiny multi-language project and asserts each analyzer produces the same node/edge counts as a recorded baseline.

Scope (out)

  • New languages (T16).
  • Re-enabling C analyzer (T16).
  • Changing graph schema or analyzer outputs.
  • Performance optimization.

Files

  • new api/analyzers/tree_sitter_base.py
  • modified api/analyzers/python/analyzer.py
  • modified api/analyzers/javascript/analyzer.py
  • modified api/analyzers/kotlin/analyzer.py
  • new tests/analyzers/test_tree_sitter_base.py

Acceptance criteria

  • All existing analyzer tests in tests/ pass unchanged.
  • Each migrated analyzer file is shorter than before and contains no parser-setup boilerplate.
  • New base class is documented with a clear docstring describing the subclass contract.
  • Regression test indexes a tiny multi-language fixture and asserts node/edge counts match the recorded baseline (catches any silent behavior change).
  • make lint and make test clean.
  • No changes to graph schema (labels, relations) — verifiable by diffing fixture-graph snapshots before/after.

Dependencies

Notes for the implementer

  • Start by reading the three existing analyzers side-by-side and listing the duplicated patterns. The base class should absorb exactly those patterns and nothing more — no speculative hooks.
  • Run make test after each analyzer migration, not just at the end. If Python migrates cleanly but JS doesn't, you want to know which step broke it.
  • Snapshot the node/edge counts of the test fixtures before starting the refactor; that's your regression baseline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmcpMCP server (model context protocol) work

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions