Skip to content

Feat/hybrid search#16

Open
dorodb-web22 wants to merge 14 commits intoAOSSIE-Org:mainfrom
dorodb-web22:feat/hybrid-search
Open

Feat/hybrid search#16
dorodb-web22 wants to merge 14 commits intoAOSSIE-Org:mainfrom
dorodb-web22:feat/hybrid-search

Conversation

@dorodb-web22
Copy link

@dorodb-web22 dorodb-web22 commented Mar 7, 2026

Description

This PR introduces Hybrid Search to the AI module, building on top of the recent embedding pipeline by integrating keyword search and combining the results using Reciprocal Rank Fusion (RRF). This ensures that searches are robust against exact-match keywords while still benefiting from semantic understanding.

Key additions:

  • Added a pure-TypeScript BM25 KeywordSearchEngine with standard English stop-word filtering and no external dependencies.
  • Created HybridSearchService that orchestrates semantic and keyword searches in parallel.
  • Implemented standard Reciprocal Rank Fusion (k=60) for mathematically sound, weightless ranking.
  • Added 26 exact behavior-driven unit tests covering both the BM25 logic and RRF integration.

All 77 AI-related tests are passing locally and TypeScript compilation is clean. I'm open to any feedback on the implementation or tuning parameters!

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Hybrid search: BM25 keyword ranking + semantic ranking with reciprocal rank fusion
    • On-device embeddings with configurable model/limits and chunking for note indexing
    • In-memory vector store with cosine-similarity search, persistence, and related-note discovery
    • Centralized AI/search exports and shared types/config defaults
  • Tests

    • Extensive unit & integration tests for embeddings, vector store, semantic, keyword, and hybrid search
  • Chores

    • Project init: package manifest, TypeScript build configs, test runner config, and .gitignore entries

@coderabbitai
Copy link

coderabbitai bot commented Mar 7, 2026

Walkthrough

Adds a TypeScript AI search stack: embedding service, in-memory persisted vector store, semantic and BM25 keyword search engines, a hybrid search coordinator with reciprocal rank fusion, shared types and barrel exports, project configs (TS, build, test), and extensive unit tests.

Changes

Cohort / File(s) Summary
Project config
/.gitignore, package.json, tsconfig.json, tsconfig.build.json, vitest.config.ts
Adds VCS ignores, package manifest, TypeScript configs for build, and Vitest test configuration and coverage settings.
Shared types & barrel
src/ai/types.ts, src/ai/index.ts
Introduces core data types, default configuration constants, and a central re-export barrel for AI modules.
Embeddings
src/ai/embeddings.ts
New EmbeddingService and EmbedFn type: lazy transformers import, injectable embed function, truncation heuristic, single/chunk embedding helpers, and initialization lifecycle.
Vector store
src/ai/vectorStore.ts
New in-memory VectorStore with cosine similarity, JSON persistence (version/model checks), search/findRelated, note-scoped operations, and utility APIs.
Semantic search
src/ai/semanticSearch.ts
New SemanticSearchService with ChunkFn/defaultChunker, embedding integration, indexing/replacement semantics, search/findRelated, persistence delegation, and note management.
Keyword search
src/ai/keywordSearch.ts
New BM25-based tokenize util and KeywordSearchEngine: inverted index, term frequencies, BM25 scoring, index/remove APIs, and stop-word filtering.
Hybrid search & fusion
src/ai/hybridSearch.ts
Adds reciprocalRankFusion and HybridSearchService combining semantic and keyword results with configurable weights, note filtering, and topK fusion behavior.
Tests
src/ai/__tests__/*.test.ts
Adds extensive unit tests covering embeddings, vector store, semantic search, keyword search, and hybrid search (mocks, persistence, edge cases, ranking and fusion).

Sequence Diagram

sequenceDiagram
    actor User
    participant Hybrid as HybridSearchService
    participant Semantic as SemanticSearchService
    participant Keyword as KeywordSearchEngine
    participant Embedding as EmbeddingService
    participant Store as VectorStore
    participant RRF as RRF

    User->>Hybrid: search(query, options)
    par semantic path
        Hybrid->>Semantic: search(query, options)
        Semantic->>Embedding: embed(query)
        Embedding-->>Semantic: queryVector
        Semantic->>Store: search(queryVector, topK, minScore)
        Store-->>Semantic: semanticResults
        Semantic-->>Hybrid: semanticResults
    and keyword path
        Hybrid->>Keyword: search(query, topK)
        Keyword-->>Hybrid: keywordResults
    end
    Hybrid->>RRF: reciprocalRankFusion([semanticResults, keywordResults], k)
    RRF-->>Hybrid: fusedResults
    Hybrid-->>User: topK fusedResults
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Typescript Lang

Poem

🐰 I nibble tokens, hop through streams of text,
Vectors hum and headings knit the rest,
Keywords and meaning tumble, blend, and play,
Fusion hops in, ranking bright as day,
Hooray — search gardens bloom and dance away! 🌿

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "Feat/hybrid search" directly relates to the main changeset, which introduces HybridSearchService combining semantic and keyword search with RRF fusion.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai/__tests__/hybridSearch.test.ts`:
- Around line 42-51: The test helper buildService creates a VectorStore without
a modelId, so index compatibility won't be checked; update buildService to pass
a modelId (e.g., use the EmbeddingService's modelId) into the VectorStore config
when constructing it: after creating embeddingService (EmbeddingService),
include modelId: embeddingService.modelId in the new VectorStore({...}) options
so VectorStore performs the same model compatibility/invalidation behavior as in
production before returning the HybridSearchService.

In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 183-195: The test "skips write when nothing changed" currently
only checks file contents and can pass even if the file was rewritten; change it
to assert that the file was not rewritten by importing stat from
node:fs/promises and using stat(join(tempDir, "test-index.json")) to capture
file metadata (e.g., mtimeMs or size) before the second await store.save(), call
store.save() again, then stat the file again and expect the metadata to be
unchanged; alternatively, spy on the write path used by the VectorStore
persistence (the function that writes the file) to assert it was not called on
the second save. Also add the stat import to the top-level imports.

In `@src/ai/embeddings.ts`:
- Around line 88-91: The embedSingle method currently returns results[0] without
verifying the array is non-empty; update embedSingle to defensively validate the
output of embed (call to this.embed([text])) — check that results is an array
with at least one element and that results[0] is defined, and if not throw a
clear error (or handle per project error conventions) so callers won't receive
undefined; modify the embedSingle function to perform this guard and surface a
descriptive error referencing embedSingle/embed.

In `@src/ai/hybridSearch.ts`:
- Around line 7-29: reciprocalRankFusion is currently writing RRF values into
SearchResult.score which breaks the documented semantic (cosine) score; update
the implementation to preserve the original cosine similarity and expose the
fusion ranking in a separate field or type (e.g., create a new
HybridSearchResult type with fields { chunk: TextChunk; score: number;
fusionScore: number } or add fusionScore to the returned objects), set
fusionScore to the computed rrfScore while keeping score as the original
item.score, and do the same change in HybridSearchService.search (the block
referenced by lines 65-94) so callers receive the original similarity in score
and the fusion metric in fusionScore or the new hybrid result type.
- Around line 37-56: Currently indexNote calls semanticSearch.indexNote(noteId,
content) which re-chunks internally while the hybrid layer separately calls
this.chunkFn(noteId, content) for keyword indexing; instead generate chunks once
(const chunks = this.chunkFn(noteId, content)) and pass that same chunks array
to both engines (e.g., change semanticSearch.indexNote to accept chunks or add a
new method like semanticSearch.indexChunks(noteId, chunks)) then call
semanticSearch.indexChunks(noteId, chunks) and keywordEngine.indexChunks(chunks)
after removing by noteId so both engines index identical chunk objects.

In `@src/ai/semanticSearch.ts`:
- Around line 42-46: The initialize() sequence can leave embeddingService
initialized if vectorStore.load() throws; wrap the vectorStore.load() call in a
try/catch and on failure perform a deterministic rollback of the embedding
service (call the appropriate cleanup/shutdown/dispose method on
embeddingService, e.g., embeddingService.shutdown() or
embeddingService.dispose()) before rethrowing the error so state remains
consistent; keep the calls inside the existing initialize() method and preserve
the original error propagation.
- Around line 84-108: The hardcoded minScore 0.4 inside findRelatedNotes (used
in vectorStore.findRelated) should be extracted and documented: add a
configurable parameter or class-level constant (e.g., a new optional parameter
minScore: number = 0.4 on findRelatedNotes or a private readonly
DEFAULT_MIN_SCORE = 0.4 on the containing class) and replace the literal 0.4
with that identifier; update the method signature and call sites accordingly and
add a short comment explaining the threshold purpose and default.

In `@src/ai/vectorStore.ts`:
- Around line 123-124: Wrap the JSON.parse call that converts the file content
into a SerializedVectorStore in a try/catch to handle SyntaxError from
malformed/corrupted index files: around the lines using readFile(filePath,
"utf-8") and JSON.parse(raw) catch parsing errors, log or include the original
error details and throw a new, user-friendly error indicating the index file at
filePath is invalid and should be rebuilt (preserve the original error as cause
or append it to the message) so callers of the vector store load routine can
surface a clear rebuild instruction.
- Around line 151-157: The loaded index JSON needs structural validation before
populating this.entries: verify that the top-level data is an object and
data.entries is an array, then iterate and check each entry has entry.chunk &&
entry.chunk.id (string/number) and entry.vector (array/typed values) before
calling this.entries.set; for any invalid/missing fields either skip the entry
with a descriptive processLogger.warn or throw a clear Error that includes which
entry/index failed, so the loader (the method that currently clears this.entries
and loops over data.entries) fails with an actionable message instead of a
cryptic runtime error.

In `@vitest.config.ts`:
- Around line 3-14: Add explicit Vitest/Vite ESM handling for the
`@huggingface/transformers` package so tests that exercise real initialize() paths
(e.g., in embeddings.ts where initialize() may dynamic-import the library or
when embedFn is not provided) don't break under Node ESM resolution; update the
vitest.config.ts test config to include the package in server.deps.inline or
ssr.noExternal (or both) to force inlining/noExternal for
"@huggingface/transformers" so the module resolves correctly during tests and
runtime.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 399f4a0c-1dda-4f24-aa00-d754cf51318e

📥 Commits

Reviewing files that changed from the base of the PR and between a3ccb2b and 3b0db18.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (17)
  • .gitignore
  • package.json
  • src/ai/__tests__/embeddings.test.ts
  • src/ai/__tests__/hybridSearch.test.ts
  • src/ai/__tests__/keywordSearch.test.ts
  • src/ai/__tests__/semanticSearch.test.ts
  • src/ai/__tests__/vectorStore.test.ts
  • src/ai/embeddings.ts
  • src/ai/hybridSearch.ts
  • src/ai/index.ts
  • src/ai/keywordSearch.ts
  • src/ai/semanticSearch.ts
  • src/ai/types.ts
  • src/ai/vectorStore.ts
  • tsconfig.build.json
  • tsconfig.json
  • vitest.config.ts

Comment on lines +42 to +51
function buildService(embedFn?: EmbedFn) {
const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn());
const vectorStore = new VectorStore({
persistDir: "/tmp/hybrid-test",
indexFilename: "test-index.json",
});
const semantic = new SemanticSearchService(embeddingService, vectorStore);
const keyword = new KeywordSearchEngine();
return new HybridSearchService(semantic, keyword);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Test buildService omits modelId from VectorStore config.

The VectorStore is created without modelId, which means it won't perform model compatibility checks during load. While acceptable for isolated tests, this deviates from production behavior where modelId enables index invalidation on model changes.

Consider adding modelId to align test configuration with production:

     const vectorStore = new VectorStore({
         persistDir: "/tmp/hybrid-test",
         indexFilename: "test-index.json",
+        modelId: "mock-model",
     });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function buildService(embedFn?: EmbedFn) {
const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn());
const vectorStore = new VectorStore({
persistDir: "/tmp/hybrid-test",
indexFilename: "test-index.json",
});
const semantic = new SemanticSearchService(embeddingService, vectorStore);
const keyword = new KeywordSearchEngine();
return new HybridSearchService(semantic, keyword);
}
function buildService(embedFn?: EmbedFn) {
const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn());
const vectorStore = new VectorStore({
persistDir: "/tmp/hybrid-test",
indexFilename: "test-index.json",
modelId: "mock-model",
});
const semantic = new SemanticSearchService(embeddingService, vectorStore);
const keyword = new KeywordSearchEngine();
return new HybridSearchService(semantic, keyword);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/__tests__/hybridSearch.test.ts` around lines 42 - 51, The test helper
buildService creates a VectorStore without a modelId, so index compatibility
won't be checked; update buildService to pass a modelId (e.g., use the
EmbeddingService's modelId) into the VectorStore config when constructing it:
after creating embeddingService (EmbeddingService), include modelId:
embeddingService.modelId in the new VectorStore({...}) options so VectorStore
performs the same model compatibility/invalidation behavior as in production
before returning the HybridSearchService.

Comment on lines +183 to +195
it("skips write when nothing changed", async () => {
store.add([makeChunk("n1", 0, [1, 0, 0])]);
await store.save();
await store.save(); // second save should be no-op

const content = await readFile(
join(tempDir, "test-index.json"),
"utf-8",
);
const data = JSON.parse(content);
expect(data.version).toBe(1);
expect(data.entries).toHaveLength(1);
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This test does not prove the second save() was a no-op.

It still passes if the implementation rewrites the same JSON on the second call. Assert file metadata stability or spy on the write path so the test actually covers the optimization named here. Also add stat to the node:fs/promises import at Line 2.

Proposed fix
 it("skips write when nothing changed", async () => {
     store.add([makeChunk("n1", 0, [1, 0, 0])]);
     await store.save();
+    const before = await stat(join(tempDir, "test-index.json"));
     await store.save(); // second save should be no-op
+    const after = await stat(join(tempDir, "test-index.json"));

     const content = await readFile(
         join(tempDir, "test-index.json"),
         "utf-8",
     );
     const data = JSON.parse(content);
+    expect(after.mtimeMs).toBe(before.mtimeMs);
     expect(data.version).toBe(1);
     expect(data.entries).toHaveLength(1);
 });

As per coding guidelines, **/*.test.{ts,tsx,js,jsx}: The tests are not tautological.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/__tests__/vectorStore.test.ts` around lines 183 - 195, The test "skips
write when nothing changed" currently only checks file contents and can pass
even if the file was rewritten; change it to assert that the file was not
rewritten by importing stat from node:fs/promises and using stat(join(tempDir,
"test-index.json")) to capture file metadata (e.g., mtimeMs or size) before the
second await store.save(), call store.save() again, then stat the file again and
expect the metadata to be unchanged; alternatively, spy on the write path used
by the VectorStore persistence (the function that writes the file) to assert it
was not called on the second save. Also add the stat import to the top-level
imports.

Comment on lines +88 to +91
async embedSingle(text: string): Promise<number[]> {
const results = await this.embed([text]);
return results[0];
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

embedSingle assumes non-empty result from embed.

Line 90 returns results[0] without checking if results is non-empty. While embed guards against empty input (line 77), if called with a single text, results will have one element. However, a defensive check could prevent undefined behavior if internal invariants change.

🛡️ Optional defensive check
     async embedSingle(text: string): Promise<number[]> {
         const results = await this.embed([text]);
+        if (results.length === 0) {
+            throw new Error("Embedding returned no results");
+        }
         return results[0];
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async embedSingle(text: string): Promise<number[]> {
const results = await this.embed([text]);
return results[0];
}
async embedSingle(text: string): Promise<number[]> {
const results = await this.embed([text]);
if (results.length === 0) {
throw new Error("Embedding returned no results");
}
return results[0];
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/embeddings.ts` around lines 88 - 91, The embedSingle method currently
returns results[0] without verifying the array is non-empty; update embedSingle
to defensively validate the output of embed (call to this.embed([text])) — check
that results is an array with at least one element and that results[0] is
defined, and if not throw a clear error (or handle per project error
conventions) so callers won't receive undefined; modify the embedSingle function
to perform this guard and surface a descriptive error referencing
embedSingle/embed.

Comment on lines +37 to +56
constructor(
semanticSearch: SemanticSearchService,
keywordEngine: KeywordSearchEngine,
chunkFn: ChunkFn = defaultChunker,
) {
this.semanticSearch = semanticSearch;
this.keywordEngine = keywordEngine;
this.chunkFn = chunkFn;
}

async initialize(): Promise<void> {
await this.semanticSearch.initialize();
}

async indexNote(noteId: string, content: string): Promise<number> {
const chunkCount = await this.semanticSearch.indexNote(noteId, content);
const chunks = this.chunkFn(noteId, content);
this.keywordEngine.removeByNoteId(noteId);
this.keywordEngine.indexChunks(chunks);
return chunkCount;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Index both engines from the same chunk list.

semanticSearch.indexNote() chunks the note internally, then the hybrid layer re-chunks the raw content for BM25. If those chunkers ever diverge, the same noteId:index can refer to different text in each engine and RRF will fuse unrelated chunks. Generate chunks once and feed the exact same list to both indexes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/hybridSearch.ts` around lines 37 - 56, Currently indexNote calls
semanticSearch.indexNote(noteId, content) which re-chunks internally while the
hybrid layer separately calls this.chunkFn(noteId, content) for keyword
indexing; instead generate chunks once (const chunks = this.chunkFn(noteId,
content)) and pass that same chunks array to both engines (e.g., change
semanticSearch.indexNote to accept chunks or add a new method like
semanticSearch.indexChunks(noteId, chunks)) then call
semanticSearch.indexChunks(noteId, chunks) and keywordEngine.indexChunks(chunks)
after removing by noteId so both engines index identical chunk objects.

Comment on lines +42 to +46
/** Load embedding model + persisted index. Call before anything else. */
async initialize(): Promise<void> {
await this.embeddingService.initialize();
await this.vectorStore.load();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential inconsistent state if vectorStore.load() throws.

If vectorStore.load() throws (e.g., on model mismatch per VectorStore.load() at lines 145-148), embeddingService remains initialized but the vector store is not loaded. This creates asymmetric state and the error isn't caught for recovery.

Consider wrapping the initialization sequence or documenting the expected caller behavior:

🛡️ Proposed fix to handle vectorStore.load() errors
     async initialize(): Promise<void> {
         await this.embeddingService.initialize();
-        await this.vectorStore.load();
+        try {
+            await this.vectorStore.load();
+        } catch (error) {
+            // Log and rethrow - caller should handle index rebuild
+            console.error("VectorStore load failed:", error);
+            throw error;
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/semanticSearch.ts` around lines 42 - 46, The initialize() sequence can
leave embeddingService initialized if vectorStore.load() throws; wrap the
vectorStore.load() call in a try/catch and on failure perform a deterministic
rollback of the embedding service (call the appropriate cleanup/shutdown/dispose
method on embeddingService, e.g., embeddingService.shutdown() or
embeddingService.dispose()) before rethrowing the error so state remains
consistent; keep the calls inside the existing initialize() method and preserve
the original error propagation.

Comment on lines +84 to +108
/** Find notes semantically related to a given note, deduplicated by noteId. */
async findRelatedNotes(
noteId: string,
topK: number = 5,
): Promise<SearchResult[]> {
// Map keeps the highest-scoring chunk per note — the old Set approach
// kept the first hit, which could discard a better score from a later
// source chunk iteration.
const bestByNoteId = new Map<string, SearchResult>();

for (const entry of this.getAllChunksForNote(noteId)) {
const related = this.vectorStore.findRelated(entry.id, topK * 2, 0.4);
for (const result of related) {
if (result.chunk.noteId === noteId) continue;
const existing = bestByNoteId.get(result.chunk.noteId);
if (!existing || result.score > existing.score) {
bestByNoteId.set(result.chunk.noteId, result);
}
}
}

return Array.from(bestByNoteId.values())
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Magic number for minScore in findRelatedNotes.

The hardcoded 0.4 minScore (line 95) lacks documentation explaining why this threshold was chosen. Consider extracting it as a configurable parameter or adding a comment.

♻️ Suggested refactor
     async findRelatedNotes(
         noteId: string,
         topK: number = 5,
+        minScore: number = 0.4,
     ): Promise<SearchResult[]> {
-        // ...
         for (const entry of this.getAllChunksForNote(noteId)) {
-            const related = this.vectorStore.findRelated(entry.id, topK * 2, 0.4);
+            const related = this.vectorStore.findRelated(entry.id, topK * 2, minScore);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/semanticSearch.ts` around lines 84 - 108, The hardcoded minScore 0.4
inside findRelatedNotes (used in vectorStore.findRelated) should be extracted
and documented: add a configurable parameter or class-level constant (e.g., a
new optional parameter minScore: number = 0.4 on findRelatedNotes or a private
readonly DEFAULT_MIN_SCORE = 0.4 on the containing class) and replace the
literal 0.4 with that identifier; update the method signature and call sites
accordingly and add a short comment explaining the threshold purpose and
default.

Comment on lines +151 to +157
this.entries.clear();
for (const entry of data.entries) {
this.entries.set(entry.chunk.id, {
chunk: entry.chunk,
vector: entry.vector,
});
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding structural validation for loaded data.

If the index file contains valid JSON but an unexpected structure (e.g., entries is not an array, or entries lack required chunk.id properties), the code will fail with a cryptic error. A brief structural check would improve debuggability.

♻️ Optional defensive validation
+        if (!Array.isArray(data.entries)) {
+            throw new Error(
+                `VectorStore: Invalid index structure. Delete "${filePath}" to rebuild.`
+            );
+        }
+
         this.entries.clear();
         for (const entry of data.entries) {
             this.entries.set(entry.chunk.id, {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
this.entries.clear();
for (const entry of data.entries) {
this.entries.set(entry.chunk.id, {
chunk: entry.chunk,
vector: entry.vector,
});
}
if (!Array.isArray(data.entries)) {
throw new Error(
`VectorStore: Invalid index structure. Delete "${filePath}" to rebuild.`
);
}
this.entries.clear();
for (const entry of data.entries) {
this.entries.set(entry.chunk.id, {
chunk: entry.chunk,
vector: entry.vector,
});
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/vectorStore.ts` around lines 151 - 157, The loaded index JSON needs
structural validation before populating this.entries: verify that the top-level
data is an object and data.entries is an array, then iterate and check each
entry has entry.chunk && entry.chunk.id (string/number) and entry.vector
(array/typed values) before calling this.entries.set; for any invalid/missing
fields either skip the entry with a descriptive processLogger.warn or throw a
clear Error that includes which entry/index failed, so the loader (the method
that currently clears this.entries and loops over data.entries) fails with an
actionable message instead of a cryptic runtime error.

Comment on lines +3 to +14
export default defineConfig({
test: {
globals: true,
environment: "node",
include: ["src/**/__tests__/**/*.test.ts"],
coverage: {
provider: "v8",
include: ["src/ai/**/*.ts"],
exclude: ["src/ai/__tests__/**", "src/ai/index.ts"],
},
},
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

# Check package.json for `@huggingface/transformers` dependency
grep -i "huggingface\|transformers" package.json

Repository: AOSSIE-Org/SmartNotes

Length of output: 106


🏁 Script executed:

# Search for any imports of `@huggingface/transformers` in the codebase
rg "@huggingface/transformers|from.*transformers" --type ts --type tsx --type js --type json

Repository: AOSSIE-Org/SmartNotes

Length of output: 92


🏁 Script executed:

# Read the full vitest.config.ts to see current configuration
cat -n vitest.config.ts

Repository: AOSSIE-Org/SmartNotes

Length of output: 531


🏁 Script executed:

# Check if there are any embeddings-related files or tests
fd -e ts -e tsx | grep -i embed

Repository: AOSSIE-Org/SmartNotes

Length of output: 121


🏁 Script executed:

# Check for any existing ESM/SSR configuration in vitest config
rg "ssr|inline|external|deps" vitest.config.ts

Repository: AOSSIE-Org/SmartNotes

Length of output: 47


🏁 Script executed:

# Check embeddings.ts imports
cat -n src/ai/embeddings.ts

Repository: AOSSIE-Org/SmartNotes

Length of output: 4511


🏁 Script executed:

# Check embeddings.test.ts content
cat -n src/ai/__tests__/embeddings.test.ts

Repository: AOSSIE-Org/SmartNotes

Length of output: 6516


🏁 Script executed:

# Search for all references to `@huggingface` in the codebase
rg "@huggingface" --type ts --type js

Repository: AOSSIE-Org/SmartNotes

Length of output: 249


Consider adding explicit Vitest configuration for @huggingface/transformers ESM handling.

While @huggingface/transformers (^3.4.0) is a declared dependency and embeddings.ts uses dynamic import with lazy loading, the current Vitest config lacks ESM/external package handling. The test suite currently avoids triggering the real import by injecting mock implementations, so tests pass. However, if code paths exercise the actual initialize() method without a custom embedFn, the Node ESM environment may require explicit server.deps.inline or ssr.noExternal configuration to prevent module resolution issues. This is a precautionary measure to ensure compatibility when real-world code paths are exercised.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vitest.config.ts` around lines 3 - 14, Add explicit Vitest/Vite ESM handling
for the `@huggingface/transformers` package so tests that exercise real
initialize() paths (e.g., in embeddings.ts where initialize() may dynamic-import
the library or when embedFn is not provided) don't break under Node ESM
resolution; update the vitest.config.ts test config to include the package in
server.deps.inline or ssr.noExternal (or both) to force inlining/noExternal for
"@huggingface/transformers" so the module resolves correctly during tests and
runtime.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/ai/hybridSearch.ts (1)

55-60: ⚠️ Potential issue | 🟠 Major

Index both engines from the same chunk array.

SemanticSearchService.indexNote() already re-chunks internally (src/ai/semanticSearch.ts:42-62), and this method chunks the raw content again for BM25. If those chunkers ever diverge, the same chunk ID can point to different text in each engine and RRF will fuse unrelated hits. Generate the chunks once here and feed that exact array to both indexes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/hybridSearch.ts` around lines 55 - 60, The code currently chunks
content twice causing mismatched chunk IDs; call this.chunkFn(noteId, content)
once to produce a single chunks array and feed that same array to both engines.
Update the call site in hybridSearch.indexNote to use the single chunks array
and change SemanticSearchService.indexNote (or add a new
semanticSearch.indexChunks) to accept the precomputed chunks array instead of
raw content so it does not re-chunk internally; then pass chunks to
keywordEngine.indexChunks(chunks) and to the semantic search method, and remove
the internal re-chunking in SemanticSearchService.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai/__tests__/hybridSearch.test.ts`:
- Around line 42-47: buildService currently hardcodes VectorStore to
"/tmp/hybrid-test", making tests non-hermetic; change buildService to create a
unique temp directory per invocation (e.g., using fs.mkdtempSync or a UUID
appended to os.tmpdir()) and pass that path into the new VectorStore({
persistDir, indexFilename }) so each test gets its own directory, and ensure the
test suite's beforeEach()/afterEach() (or initialize()/teardown) removes or
cleans that temp dir after the test to avoid cross-test contamination.

In `@src/ai/hybridSearch.ts`:
- Around line 51-53: The initialize() method restores only the semantic store
(semanticSearch) but does not rebuild or restore the BM25 keyword index
(keywordEngine), so keywordEngine is empty after restart; update initialize() to
load persisted chunks/documents (the same source semanticSearch uses) and
rebuild or re-index keywordEngine from those persisted chunks, and update save()
to persist either the keywordEngine state or the chunks/documents used to
rebuild it (so keywordEngine can be reconstructed on startup); reference the
initialize() and save() methods, keywordEngine, semanticSearch and the persisted
chunks/documents used by the vector store when implementing this.
- Around line 76-87: The keyword results are being truncated by
KeywordSearchEngine.search before you apply the noteId filter, causing
note-scoped searches to miss matches; fix by applying noteId filtering inside
the keyword search path: either extend KeywordSearchEngine.search to accept an
optional noteId (and filter before slice) and call keywordEngine.search(query,
broadK, opts.noteId) here, or request the full/un-truncated set (e.g., topK =
Infinity or a large number) so you can filter by opts.noteId in this module
before slicing to broadK; update references to keywordEngine.search and ensure
filteredKeyword is produced from the pre-sliced results.

---

Duplicate comments:
In `@src/ai/hybridSearch.ts`:
- Around line 55-60: The code currently chunks content twice causing mismatched
chunk IDs; call this.chunkFn(noteId, content) once to produce a single chunks
array and feed that same array to both engines. Update the call site in
hybridSearch.indexNote to use the single chunks array and change
SemanticSearchService.indexNote (or add a new semanticSearch.indexChunks) to
accept the precomputed chunks array instead of raw content so it does not
re-chunk internally; then pass chunks to keywordEngine.indexChunks(chunks) and
to the semantic search method, and remove the internal re-chunking in
SemanticSearchService.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 05c80561-5fea-473f-a2bf-67aaf9d9ebff

📥 Commits

Reviewing files that changed from the base of the PR and between 3b0db18 and 3a26f06.

📒 Files selected for processing (4)
  • src/ai/__tests__/hybridSearch.test.ts
  • src/ai/hybridSearch.ts
  • src/ai/index.ts
  • src/ai/types.ts

Comment on lines +42 to +47
function buildService(embedFn?: EmbedFn) {
const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn());
const vectorStore = new VectorStore({
persistDir: "/tmp/hybrid-test",
indexFilename: "test-index.json",
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use a unique temp directory per test helper.

buildService() always points VectorStore at /tmp/hybrid-test/test-index.json, and beforeEach() calls initialize(), which can load whatever was left there by another run. That makes this suite non-hermetic. Use a per-test temp dir or clean the path in setup/teardown.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/__tests__/hybridSearch.test.ts` around lines 42 - 47, buildService
currently hardcodes VectorStore to "/tmp/hybrid-test", making tests
non-hermetic; change buildService to create a unique temp directory per
invocation (e.g., using fs.mkdtempSync or a UUID appended to os.tmpdir()) and
pass that path into the new VectorStore({ persistDir, indexFilename }) so each
test gets its own directory, and ensure the test suite's
beforeEach()/afterEach() (or initialize()/teardown) removes or cleans that temp
dir after the test to avoid cross-test contamination.

Comment on lines +51 to +53
async initialize(): Promise<void> {
await this.semanticSearch.initialize();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Persist or rebuild the BM25 index on startup.

initialize() only restores the semantic store, and save() only persists that same store. After a restart, keywordEngine comes back empty, so hybrid search silently degrades to semantic-only until every note is reindexed. Rebuild BM25 from persisted chunks during initialize(), or persist keyword state alongside the vector store.

Also applies to: 100-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/hybridSearch.ts` around lines 51 - 53, The initialize() method
restores only the semantic store (semanticSearch) but does not rebuild or
restore the BM25 keyword index (keywordEngine), so keywordEngine is empty after
restart; update initialize() to load persisted chunks/documents (the same source
semanticSearch uses) and rebuild or re-index keywordEngine from those persisted
chunks, and update save() to persist either the keywordEngine state or the
chunks/documents used to rebuild it (so keywordEngine can be reconstructed on
startup); reference the initialize() and save() methods, keywordEngine,
semanticSearch and the persisted chunks/documents used by the vector store when
implementing this.

Comment on lines +76 to +87
const [semanticResults, keywordResults] = await Promise.all([
this.semanticSearch.search(query, {
topK: broadK,
minScore: 0,
noteId: opts.noteId,
}),
Promise.resolve(this.keywordEngine.search(query, broadK)),
]);

const filteredKeyword = opts.noteId
? keywordResults.filter((r) => r.chunk.noteId === opts.noteId)
: keywordResults;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply noteId filtering before keyword topK.

KeywordSearchEngine.search() truncates with slice(0, topK) before returning (src/ai/keywordSearch.ts:82-113). Filtering that reduced list afterward means note-scoped searches can miss valid keyword matches whenever other notes occupy the first broadK slots. Move noteId filtering into the keyword search path, or fetch all keyword matches before slicing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/hybridSearch.ts` around lines 76 - 87, The keyword results are being
truncated by KeywordSearchEngine.search before you apply the noteId filter,
causing note-scoped searches to miss matches; fix by applying noteId filtering
inside the keyword search path: either extend KeywordSearchEngine.search to
accept an optional noteId (and filter before slice) and call
keywordEngine.search(query, broadK, opts.noteId) here, or request the
full/un-truncated set (e.g., topK = Infinity or a large number) so you can
filter by opts.noteId in this module before slicing to broadK; update references
to keywordEngine.search and ensure filteredKeyword is produced from the
pre-sliced results.

@github-actions github-actions bot removed the size/XL label Mar 7, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/ai/hybridSearch.ts (2)

72-83: ⚠️ Potential issue | 🟠 Major

Filter keyword hits before truncating to topK.

KeywordSearchEngine.search() slices before returning, so filtering that reduced list at Line 81 can drop valid note-scoped matches whenever other notes occupy the first broadK BM25 slots.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/hybridSearch.ts` around lines 72 - 83, Keyword search results are
pre-sliced by KeywordSearchEngine.search, so filtering by opts.noteId after that
(into filteredKeyword) can drop valid note-scoped hits; change the flow to fetch
untruncated keyword hits (e.g., expose/use a searchAll / unsliced variant or
call keywordEngine.search with a sufficiently large K) then apply the noteId
filter (opts.noteId) to that full set and only after filtering truncate to
broadK (used alongside semanticResults); update usages around
keywordEngine.search, filteredKeyword, broadK, and opts.noteId accordingly.

47-49: ⚠️ Potential issue | 🟠 Major

Persist or rebuild the BM25 index on startup.

initialize() restores only the semantic store, and save() persists only that same state. After a restart, keywordEngine is empty, so hybrid search silently degrades to semantic-only until every note is reindexed.

Also applies to: 96-98

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/hybridSearch.ts` around lines 47 - 49, initialize() currently only
restores the semantic store so keywordEngine/BM25 is empty after restart; update
initialize() (and the save() path referenced near lines 96-98) to persist and
restore the BM25/keywordEngine state or trigger a rebuild: add serialization of
the BM25 index when save() runs and load/deserializer logic in initialize() to
repopulate keywordEngine, and if no serialized snapshot exists call the existing
reindex routine (e.g., reindexAll or the method that iterates notes) to rebuild
the BM25 index on startup; ensure you reference and update keywordEngine,
initialize(), save(), and the BM25 serialization/rebuild helpers so hybridSearch
is fully restored after restart.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/ai/hybridSearch.ts`:
- Around line 72-83: Keyword search results are pre-sliced by
KeywordSearchEngine.search, so filtering by opts.noteId after that (into
filteredKeyword) can drop valid note-scoped hits; change the flow to fetch
untruncated keyword hits (e.g., expose/use a searchAll / unsliced variant or
call keywordEngine.search with a sufficiently large K) then apply the noteId
filter (opts.noteId) to that full set and only after filtering truncate to
broadK (used alongside semanticResults); update usages around
keywordEngine.search, filteredKeyword, broadK, and opts.noteId accordingly.
- Around line 47-49: initialize() currently only restores the semantic store so
keywordEngine/BM25 is empty after restart; update initialize() (and the save()
path referenced near lines 96-98) to persist and restore the BM25/keywordEngine
state or trigger a rebuild: add serialization of the BM25 index when save() runs
and load/deserializer logic in initialize() to repopulate keywordEngine, and if
no serialized snapshot exists call the existing reindex routine (e.g.,
reindexAll or the method that iterates notes) to rebuild the BM25 index on
startup; ensure you reference and update keywordEngine, initialize(), save(),
and the BM25 serialization/rebuild helpers so hybridSearch is fully restored
after restart.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bb239b31-74cd-431f-891e-2c3a8539b8d7

📥 Commits

Reviewing files that changed from the base of the PR and between 3a26f06 and dbcab75.

📒 Files selected for processing (2)
  • src/ai/hybridSearch.ts
  • src/ai/semanticSearch.ts

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/ai/vectorStore.ts (1)

149-165: ⚠️ Potential issue | 🟠 Major

Validate every loaded row before rebuilding the map.

This path only proves that the JSON parses and that the first entry matches vectorDim. Valid JSON with the wrong shape can still fail with opaque TypeErrors at Line 149 or Line 161, and a later mixed-dimension row slips through until search time. Validate entries up front and check each row's chunk fields plus vector length before populating this.entries.

Proposed fix
+        if (!Array.isArray(data.entries)) {
+            throw new Error(
+                `VectorStore: invalid index structure in "${filePath}". ` +
+                `Delete the file to rebuild the index.`,
+            );
+        }
+
-        if (data.vectorDim !== undefined && data.entries.length > 0) {
-            const actualDim = data.entries[0].vector.length;
-            if (actualDim !== data.vectorDim) {
-                throw new Error(
-                    `VectorStore: index vectorDim mismatch ` +
-                    `(stored ${data.vectorDim}, found ${actualDim}). Index may be corrupted.`,
-                );
-            }
-        }
-
         this.entries.clear();
-        for (const entry of data.entries) {
+        for (const [index, entry] of data.entries.entries()) {
+            if (
+                !entry ||
+                !entry.chunk ||
+                typeof entry.chunk.id !== "string" ||
+                typeof entry.chunk.noteId !== "string" ||
+                typeof entry.chunk.content !== "string" ||
+                typeof entry.chunk.chunkIndex !== "number" ||
+                !Array.isArray(entry.vector)
+            ) {
+                throw new Error(
+                    `VectorStore: invalid entry at index ${index} in "${filePath}". ` +
+                    `Delete the file to rebuild the index.`,
+                );
+            }
+            if (
+                data.vectorDim !== undefined &&
+                entry.vector.length !== data.vectorDim
+            ) {
+                throw new Error(
+                    `VectorStore: invalid vector dimension at entry ${index} in "${filePath}".`,
+                );
+            }
             this.entries.set(entry.chunk.id, {
                 chunk: entry.chunk,
                 vector: entry.vector,
             });
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/vectorStore.ts` around lines 149 - 165, Validate the loaded payload
before rebuilding this.entries: confirm data.entries is an array and iterate
each entry to ensure it has a chunk object with required fields (at least
chunk.id) and a numeric vector array whose length matches data.vectorDim (when
data.vectorDim is defined) and all elements are numbers; throw a clear Error
identifying the offending entry if any check fails; only after all entries pass
validation, clear this.entries and populate it using entry.chunk.id -> { chunk:
entry.chunk, vector: entry.vector } so malformed rows never get inserted
(referencing data.entries, data.vectorDim, entry.chunk, entry.vector, and
this.entries).
src/ai/__tests__/vectorStore.test.ts (1)

188-199: ⚠️ Potential issue | 🟡 Minor

Assert the no-op, not just the final contents.

Line 191 says the second save() should be a no-op, but this test still passes if save() rewrites the same JSON. Capture file metadata before/after the second call, or spy on the write path, so the test proves the optimization instead of only re-reading the final file contents.

Proposed fix
-import { mkdtemp, rm, readFile, writeFile } from "node:fs/promises";
+import { mkdtemp, rm, readFile, writeFile, stat } from "node:fs/promises";
...
         it("skips write when nothing changed", async () => {
             store.add([makeChunk("n1", 0, [1, 0, 0])]);
             await store.save();
+            const before = await stat(join(tempDir, "test-index.json"));
             await store.save(); // second save should be no-op
+            const after = await stat(join(tempDir, "test-index.json"));

             const content = await readFile(
                 join(tempDir, "test-index.json"),
                 "utf-8",
             );
             const data = JSON.parse(content);
+            expect(after.mtimeMs).toBe(before.mtimeMs);
             expect(data.version).toBe(1);
             expect(data.entries).toHaveLength(1);
         });

As per coding guidelines, **/*.test.{ts,tsx,js,jsx}: The tests are not tautological.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/__tests__/vectorStore.test.ts` around lines 188 - 199, The test
currently only verifies final file contents but must assert the second
store.save() is a true no-op; modify the "skips write when nothing changed" test
to capture file metadata or spy the write path before the second save (e.g.,
stat the file mtime or spy on fs.writeFile/Store's internal write method used by
save()), call await store.save() the second time, then assert that no write
occurred (mtime unchanged or spy not called) in addition to the existing content
checks; locate the logic around store.save() and the file read (readFile/join
usage) to add the stat/spy and the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai/vectorStore.ts`:
- Around line 28-33: The add(embeddedChunks: EmbeddedChunk[]) method currently
mutates this.entries before validating embedding sizes, allowing mixed-dimension
vectors to be stored; update add to first determine the expected vector length
(use an existing this.dimension property or set it from the first inserted
ec.embedding.length if missing) and validate every ec.embedding.length against
that expected dimension, throwing an error and refusing to mutate this.entries
if any mismatch is found; ensure you set this.dimension on the first successful
insert and only mark this.dirty = true after all entries have been validated and
inserted.

---

Duplicate comments:
In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 188-199: The test currently only verifies final file contents but
must assert the second store.save() is a true no-op; modify the "skips write
when nothing changed" test to capture file metadata or spy the write path before
the second save (e.g., stat the file mtime or spy on fs.writeFile/Store's
internal write method used by save()), call await store.save() the second time,
then assert that no write occurred (mtime unchanged or spy not called) in
addition to the existing content checks; locate the logic around store.save()
and the file read (readFile/join usage) to add the stat/spy and the assertion.

In `@src/ai/vectorStore.ts`:
- Around line 149-165: Validate the loaded payload before rebuilding
this.entries: confirm data.entries is an array and iterate each entry to ensure
it has a chunk object with required fields (at least chunk.id) and a numeric
vector array whose length matches data.vectorDim (when data.vectorDim is
defined) and all elements are numbers; throw a clear Error identifying the
offending entry if any check fails; only after all entries pass validation,
clear this.entries and populate it using entry.chunk.id -> { chunk: entry.chunk,
vector: entry.vector } so malformed rows never get inserted (referencing
data.entries, data.vectorDim, entry.chunk, entry.vector, and this.entries).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f46da6c2-eb49-4565-8c70-ea5406ae89a6

📥 Commits

Reviewing files that changed from the base of the PR and between dbcab75 and 021baf9.

📒 Files selected for processing (2)
  • src/ai/__tests__/vectorStore.test.ts
  • src/ai/vectorStore.ts

Comment on lines +28 to +33
add(embeddedChunks: EmbeddedChunk[]): void {
for (const ec of embeddedChunks) {
this.entries.set(ec.chunk.id, ec);
}
this.dirty = true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject mixed-dimension vectors on insert.

search() and findRelated() assume every stored embedding has the same length. Right now one bad chunk can be added successfully, persisted, and then cause every later query to throw Vector dimension mismatch when iteration reaches it. Enforce a single vector dimension before mutating this.entries.

Proposed fix
     /** Add chunks. Overwrites existing entries with the same ID. */
     add(embeddedChunks: EmbeddedChunk[]): void {
+        const expectedDim =
+            this.entries.values().next().value?.vector.length ??
+            embeddedChunks[0]?.vector.length;
+
         for (const ec of embeddedChunks) {
+            if (expectedDim !== undefined && ec.vector.length !== expectedDim) {
+                throw new Error(
+                    `VectorStore: expected ${expectedDim}-dim vectors, got ` +
+                    `${ec.vector.length} for chunk "${ec.chunk.id}".`,
+                );
+            }
             this.entries.set(ec.chunk.id, ec);
         }
         this.dirty = true;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/vectorStore.ts` around lines 28 - 33, The add(embeddedChunks:
EmbeddedChunk[]) method currently mutates this.entries before validating
embedding sizes, allowing mixed-dimension vectors to be stored; update add to
first determine the expected vector length (use an existing this.dimension
property or set it from the first inserted ec.embedding.length if missing) and
validate every ec.embedding.length against that expected dimension, throwing an
error and refusing to mutate this.entries if any mismatch is found; ensure you
set this.dimension on the first successful insert and only mark this.dirty =
true after all entries have been validated and inserted.

@sharma-sugurthi
Copy link

@dorodb-web22 really solid first implementation- BM25 + RRF + local embeddings in pure TypeScript with 77 tests is impressive work.

A few things I noticed:

Blocking concerns:

  • BM25 state is lost on restart (hybridSearch.ts initialize()). after app restart, keywordEngine comes back empty while semanticSearch restores from disk. This means hybrid search silently falls back to semantic-only. Needs either BM25 persistence or an index rebuild from persisted chunk data on startup.

  • noteId filter is applied after topK truncation in keywordEngine.search(). If other notes dominate the top-K BM25 results, the per-note filter returns nothing even when matches exist. The filter needs to happen before the slice.

  • vector dimension not validated in VectorStore.add(). A mismatched vector silently enters the store and poisons all future queries. Worth a cheap dimension check on insert.

  • minor: tests in hybridSearch.test.ts share a fixed /tmp/hybrid-test path — this makes the suite non-hermetic across runs. A per-test temp dir (via mkdtemp) would fix it.

the per PR fixes you have been pushing (chunker reuse, error descriptiveness, fusionScore) are exactly the right things to do. the 3 blocking items above are the main things I would want to see resolved before merge.

happy to discuss any of them...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants