Conversation
WalkthroughAdds a TypeScript AI search stack: embedding service, in-memory persisted vector store, semantic and BM25 keyword search engines, a hybrid search coordinator with reciprocal rank fusion, shared types and barrel exports, project configs (TS, build, test), and extensive unit tests. Changes
Sequence DiagramsequenceDiagram
actor User
participant Hybrid as HybridSearchService
participant Semantic as SemanticSearchService
participant Keyword as KeywordSearchEngine
participant Embedding as EmbeddingService
participant Store as VectorStore
participant RRF as RRF
User->>Hybrid: search(query, options)
par semantic path
Hybrid->>Semantic: search(query, options)
Semantic->>Embedding: embed(query)
Embedding-->>Semantic: queryVector
Semantic->>Store: search(queryVector, topK, minScore)
Store-->>Semantic: semanticResults
Semantic-->>Hybrid: semanticResults
and keyword path
Hybrid->>Keyword: search(query, topK)
Keyword-->>Hybrid: keywordResults
end
Hybrid->>RRF: reciprocalRankFusion([semanticResults, keywordResults], k)
RRF-->>Hybrid: fusedResults
Hybrid-->>User: topK fusedResults
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai/__tests__/hybridSearch.test.ts`:
- Around line 42-51: The test helper buildService creates a VectorStore without
a modelId, so index compatibility won't be checked; update buildService to pass
a modelId (e.g., use the EmbeddingService's modelId) into the VectorStore config
when constructing it: after creating embeddingService (EmbeddingService),
include modelId: embeddingService.modelId in the new VectorStore({...}) options
so VectorStore performs the same model compatibility/invalidation behavior as in
production before returning the HybridSearchService.
In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 183-195: The test "skips write when nothing changed" currently
only checks file contents and can pass even if the file was rewritten; change it
to assert that the file was not rewritten by importing stat from
node:fs/promises and using stat(join(tempDir, "test-index.json")) to capture
file metadata (e.g., mtimeMs or size) before the second await store.save(), call
store.save() again, then stat the file again and expect the metadata to be
unchanged; alternatively, spy on the write path used by the VectorStore
persistence (the function that writes the file) to assert it was not called on
the second save. Also add the stat import to the top-level imports.
In `@src/ai/embeddings.ts`:
- Around line 88-91: The embedSingle method currently returns results[0] without
verifying the array is non-empty; update embedSingle to defensively validate the
output of embed (call to this.embed([text])) — check that results is an array
with at least one element and that results[0] is defined, and if not throw a
clear error (or handle per project error conventions) so callers won't receive
undefined; modify the embedSingle function to perform this guard and surface a
descriptive error referencing embedSingle/embed.
In `@src/ai/hybridSearch.ts`:
- Around line 7-29: reciprocalRankFusion is currently writing RRF values into
SearchResult.score which breaks the documented semantic (cosine) score; update
the implementation to preserve the original cosine similarity and expose the
fusion ranking in a separate field or type (e.g., create a new
HybridSearchResult type with fields { chunk: TextChunk; score: number;
fusionScore: number } or add fusionScore to the returned objects), set
fusionScore to the computed rrfScore while keeping score as the original
item.score, and do the same change in HybridSearchService.search (the block
referenced by lines 65-94) so callers receive the original similarity in score
and the fusion metric in fusionScore or the new hybrid result type.
- Around line 37-56: Currently indexNote calls semanticSearch.indexNote(noteId,
content) which re-chunks internally while the hybrid layer separately calls
this.chunkFn(noteId, content) for keyword indexing; instead generate chunks once
(const chunks = this.chunkFn(noteId, content)) and pass that same chunks array
to both engines (e.g., change semanticSearch.indexNote to accept chunks or add a
new method like semanticSearch.indexChunks(noteId, chunks)) then call
semanticSearch.indexChunks(noteId, chunks) and keywordEngine.indexChunks(chunks)
after removing by noteId so both engines index identical chunk objects.
In `@src/ai/semanticSearch.ts`:
- Around line 42-46: The initialize() sequence can leave embeddingService
initialized if vectorStore.load() throws; wrap the vectorStore.load() call in a
try/catch and on failure perform a deterministic rollback of the embedding
service (call the appropriate cleanup/shutdown/dispose method on
embeddingService, e.g., embeddingService.shutdown() or
embeddingService.dispose()) before rethrowing the error so state remains
consistent; keep the calls inside the existing initialize() method and preserve
the original error propagation.
- Around line 84-108: The hardcoded minScore 0.4 inside findRelatedNotes (used
in vectorStore.findRelated) should be extracted and documented: add a
configurable parameter or class-level constant (e.g., a new optional parameter
minScore: number = 0.4 on findRelatedNotes or a private readonly
DEFAULT_MIN_SCORE = 0.4 on the containing class) and replace the literal 0.4
with that identifier; update the method signature and call sites accordingly and
add a short comment explaining the threshold purpose and default.
In `@src/ai/vectorStore.ts`:
- Around line 123-124: Wrap the JSON.parse call that converts the file content
into a SerializedVectorStore in a try/catch to handle SyntaxError from
malformed/corrupted index files: around the lines using readFile(filePath,
"utf-8") and JSON.parse(raw) catch parsing errors, log or include the original
error details and throw a new, user-friendly error indicating the index file at
filePath is invalid and should be rebuilt (preserve the original error as cause
or append it to the message) so callers of the vector store load routine can
surface a clear rebuild instruction.
- Around line 151-157: The loaded index JSON needs structural validation before
populating this.entries: verify that the top-level data is an object and
data.entries is an array, then iterate and check each entry has entry.chunk &&
entry.chunk.id (string/number) and entry.vector (array/typed values) before
calling this.entries.set; for any invalid/missing fields either skip the entry
with a descriptive processLogger.warn or throw a clear Error that includes which
entry/index failed, so the loader (the method that currently clears this.entries
and loops over data.entries) fails with an actionable message instead of a
cryptic runtime error.
In `@vitest.config.ts`:
- Around line 3-14: Add explicit Vitest/Vite ESM handling for the
`@huggingface/transformers` package so tests that exercise real initialize() paths
(e.g., in embeddings.ts where initialize() may dynamic-import the library or
when embedFn is not provided) don't break under Node ESM resolution; update the
vitest.config.ts test config to include the package in server.deps.inline or
ssr.noExternal (or both) to force inlining/noExternal for
"@huggingface/transformers" so the module resolves correctly during tests and
runtime.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 399f4a0c-1dda-4f24-aa00-d754cf51318e
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (17)
.gitignorepackage.jsonsrc/ai/__tests__/embeddings.test.tssrc/ai/__tests__/hybridSearch.test.tssrc/ai/__tests__/keywordSearch.test.tssrc/ai/__tests__/semanticSearch.test.tssrc/ai/__tests__/vectorStore.test.tssrc/ai/embeddings.tssrc/ai/hybridSearch.tssrc/ai/index.tssrc/ai/keywordSearch.tssrc/ai/semanticSearch.tssrc/ai/types.tssrc/ai/vectorStore.tstsconfig.build.jsontsconfig.jsonvitest.config.ts
| function buildService(embedFn?: EmbedFn) { | ||
| const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn()); | ||
| const vectorStore = new VectorStore({ | ||
| persistDir: "/tmp/hybrid-test", | ||
| indexFilename: "test-index.json", | ||
| }); | ||
| const semantic = new SemanticSearchService(embeddingService, vectorStore); | ||
| const keyword = new KeywordSearchEngine(); | ||
| return new HybridSearchService(semantic, keyword); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Test buildService omits modelId from VectorStore config.
The VectorStore is created without modelId, which means it won't perform model compatibility checks during load. While acceptable for isolated tests, this deviates from production behavior where modelId enables index invalidation on model changes.
Consider adding modelId to align test configuration with production:
const vectorStore = new VectorStore({
persistDir: "/tmp/hybrid-test",
indexFilename: "test-index.json",
+ modelId: "mock-model",
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function buildService(embedFn?: EmbedFn) { | |
| const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn()); | |
| const vectorStore = new VectorStore({ | |
| persistDir: "/tmp/hybrid-test", | |
| indexFilename: "test-index.json", | |
| }); | |
| const semantic = new SemanticSearchService(embeddingService, vectorStore); | |
| const keyword = new KeywordSearchEngine(); | |
| return new HybridSearchService(semantic, keyword); | |
| } | |
| function buildService(embedFn?: EmbedFn) { | |
| const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn()); | |
| const vectorStore = new VectorStore({ | |
| persistDir: "/tmp/hybrid-test", | |
| indexFilename: "test-index.json", | |
| modelId: "mock-model", | |
| }); | |
| const semantic = new SemanticSearchService(embeddingService, vectorStore); | |
| const keyword = new KeywordSearchEngine(); | |
| return new HybridSearchService(semantic, keyword); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/__tests__/hybridSearch.test.ts` around lines 42 - 51, The test helper
buildService creates a VectorStore without a modelId, so index compatibility
won't be checked; update buildService to pass a modelId (e.g., use the
EmbeddingService's modelId) into the VectorStore config when constructing it:
after creating embeddingService (EmbeddingService), include modelId:
embeddingService.modelId in the new VectorStore({...}) options so VectorStore
performs the same model compatibility/invalidation behavior as in production
before returning the HybridSearchService.
| it("skips write when nothing changed", async () => { | ||
| store.add([makeChunk("n1", 0, [1, 0, 0])]); | ||
| await store.save(); | ||
| await store.save(); // second save should be no-op | ||
|
|
||
| const content = await readFile( | ||
| join(tempDir, "test-index.json"), | ||
| "utf-8", | ||
| ); | ||
| const data = JSON.parse(content); | ||
| expect(data.version).toBe(1); | ||
| expect(data.entries).toHaveLength(1); | ||
| }); |
There was a problem hiding this comment.
This test does not prove the second save() was a no-op.
It still passes if the implementation rewrites the same JSON on the second call. Assert file metadata stability or spy on the write path so the test actually covers the optimization named here. Also add stat to the node:fs/promises import at Line 2.
Proposed fix
it("skips write when nothing changed", async () => {
store.add([makeChunk("n1", 0, [1, 0, 0])]);
await store.save();
+ const before = await stat(join(tempDir, "test-index.json"));
await store.save(); // second save should be no-op
+ const after = await stat(join(tempDir, "test-index.json"));
const content = await readFile(
join(tempDir, "test-index.json"),
"utf-8",
);
const data = JSON.parse(content);
+ expect(after.mtimeMs).toBe(before.mtimeMs);
expect(data.version).toBe(1);
expect(data.entries).toHaveLength(1);
});As per coding guidelines, **/*.test.{ts,tsx,js,jsx}: The tests are not tautological.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/__tests__/vectorStore.test.ts` around lines 183 - 195, The test "skips
write when nothing changed" currently only checks file contents and can pass
even if the file was rewritten; change it to assert that the file was not
rewritten by importing stat from node:fs/promises and using stat(join(tempDir,
"test-index.json")) to capture file metadata (e.g., mtimeMs or size) before the
second await store.save(), call store.save() again, then stat the file again and
expect the metadata to be unchanged; alternatively, spy on the write path used
by the VectorStore persistence (the function that writes the file) to assert it
was not called on the second save. Also add the stat import to the top-level
imports.
| async embedSingle(text: string): Promise<number[]> { | ||
| const results = await this.embed([text]); | ||
| return results[0]; | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
embedSingle assumes non-empty result from embed.
Line 90 returns results[0] without checking if results is non-empty. While embed guards against empty input (line 77), if called with a single text, results will have one element. However, a defensive check could prevent undefined behavior if internal invariants change.
🛡️ Optional defensive check
async embedSingle(text: string): Promise<number[]> {
const results = await this.embed([text]);
+ if (results.length === 0) {
+ throw new Error("Embedding returned no results");
+ }
return results[0];
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async embedSingle(text: string): Promise<number[]> { | |
| const results = await this.embed([text]); | |
| return results[0]; | |
| } | |
| async embedSingle(text: string): Promise<number[]> { | |
| const results = await this.embed([text]); | |
| if (results.length === 0) { | |
| throw new Error("Embedding returned no results"); | |
| } | |
| return results[0]; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/embeddings.ts` around lines 88 - 91, The embedSingle method currently
returns results[0] without verifying the array is non-empty; update embedSingle
to defensively validate the output of embed (call to this.embed([text])) — check
that results is an array with at least one element and that results[0] is
defined, and if not throw a clear error (or handle per project error
conventions) so callers won't receive undefined; modify the embedSingle function
to perform this guard and surface a descriptive error referencing
embedSingle/embed.
| constructor( | ||
| semanticSearch: SemanticSearchService, | ||
| keywordEngine: KeywordSearchEngine, | ||
| chunkFn: ChunkFn = defaultChunker, | ||
| ) { | ||
| this.semanticSearch = semanticSearch; | ||
| this.keywordEngine = keywordEngine; | ||
| this.chunkFn = chunkFn; | ||
| } | ||
|
|
||
| async initialize(): Promise<void> { | ||
| await this.semanticSearch.initialize(); | ||
| } | ||
|
|
||
| async indexNote(noteId: string, content: string): Promise<number> { | ||
| const chunkCount = await this.semanticSearch.indexNote(noteId, content); | ||
| const chunks = this.chunkFn(noteId, content); | ||
| this.keywordEngine.removeByNoteId(noteId); | ||
| this.keywordEngine.indexChunks(chunks); | ||
| return chunkCount; |
There was a problem hiding this comment.
Index both engines from the same chunk list.
semanticSearch.indexNote() chunks the note internally, then the hybrid layer re-chunks the raw content for BM25. If those chunkers ever diverge, the same noteId:index can refer to different text in each engine and RRF will fuse unrelated chunks. Generate chunks once and feed the exact same list to both indexes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/hybridSearch.ts` around lines 37 - 56, Currently indexNote calls
semanticSearch.indexNote(noteId, content) which re-chunks internally while the
hybrid layer separately calls this.chunkFn(noteId, content) for keyword
indexing; instead generate chunks once (const chunks = this.chunkFn(noteId,
content)) and pass that same chunks array to both engines (e.g., change
semanticSearch.indexNote to accept chunks or add a new method like
semanticSearch.indexChunks(noteId, chunks)) then call
semanticSearch.indexChunks(noteId, chunks) and keywordEngine.indexChunks(chunks)
after removing by noteId so both engines index identical chunk objects.
| /** Load embedding model + persisted index. Call before anything else. */ | ||
| async initialize(): Promise<void> { | ||
| await this.embeddingService.initialize(); | ||
| await this.vectorStore.load(); | ||
| } |
There was a problem hiding this comment.
Potential inconsistent state if vectorStore.load() throws.
If vectorStore.load() throws (e.g., on model mismatch per VectorStore.load() at lines 145-148), embeddingService remains initialized but the vector store is not loaded. This creates asymmetric state and the error isn't caught for recovery.
Consider wrapping the initialization sequence or documenting the expected caller behavior:
🛡️ Proposed fix to handle vectorStore.load() errors
async initialize(): Promise<void> {
await this.embeddingService.initialize();
- await this.vectorStore.load();
+ try {
+ await this.vectorStore.load();
+ } catch (error) {
+ // Log and rethrow - caller should handle index rebuild
+ console.error("VectorStore load failed:", error);
+ throw error;
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/semanticSearch.ts` around lines 42 - 46, The initialize() sequence can
leave embeddingService initialized if vectorStore.load() throws; wrap the
vectorStore.load() call in a try/catch and on failure perform a deterministic
rollback of the embedding service (call the appropriate cleanup/shutdown/dispose
method on embeddingService, e.g., embeddingService.shutdown() or
embeddingService.dispose()) before rethrowing the error so state remains
consistent; keep the calls inside the existing initialize() method and preserve
the original error propagation.
| /** Find notes semantically related to a given note, deduplicated by noteId. */ | ||
| async findRelatedNotes( | ||
| noteId: string, | ||
| topK: number = 5, | ||
| ): Promise<SearchResult[]> { | ||
| // Map keeps the highest-scoring chunk per note — the old Set approach | ||
| // kept the first hit, which could discard a better score from a later | ||
| // source chunk iteration. | ||
| const bestByNoteId = new Map<string, SearchResult>(); | ||
|
|
||
| for (const entry of this.getAllChunksForNote(noteId)) { | ||
| const related = this.vectorStore.findRelated(entry.id, topK * 2, 0.4); | ||
| for (const result of related) { | ||
| if (result.chunk.noteId === noteId) continue; | ||
| const existing = bestByNoteId.get(result.chunk.noteId); | ||
| if (!existing || result.score > existing.score) { | ||
| bestByNoteId.set(result.chunk.noteId, result); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return Array.from(bestByNoteId.values()) | ||
| .sort((a, b) => b.score - a.score) | ||
| .slice(0, topK); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Magic number for minScore in findRelatedNotes.
The hardcoded 0.4 minScore (line 95) lacks documentation explaining why this threshold was chosen. Consider extracting it as a configurable parameter or adding a comment.
♻️ Suggested refactor
async findRelatedNotes(
noteId: string,
topK: number = 5,
+ minScore: number = 0.4,
): Promise<SearchResult[]> {
- // ...
for (const entry of this.getAllChunksForNote(noteId)) {
- const related = this.vectorStore.findRelated(entry.id, topK * 2, 0.4);
+ const related = this.vectorStore.findRelated(entry.id, topK * 2, minScore);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/semanticSearch.ts` around lines 84 - 108, The hardcoded minScore 0.4
inside findRelatedNotes (used in vectorStore.findRelated) should be extracted
and documented: add a configurable parameter or class-level constant (e.g., a
new optional parameter minScore: number = 0.4 on findRelatedNotes or a private
readonly DEFAULT_MIN_SCORE = 0.4 on the containing class) and replace the
literal 0.4 with that identifier; update the method signature and call sites
accordingly and add a short comment explaining the threshold purpose and
default.
| this.entries.clear(); | ||
| for (const entry of data.entries) { | ||
| this.entries.set(entry.chunk.id, { | ||
| chunk: entry.chunk, | ||
| vector: entry.vector, | ||
| }); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding structural validation for loaded data.
If the index file contains valid JSON but an unexpected structure (e.g., entries is not an array, or entries lack required chunk.id properties), the code will fail with a cryptic error. A brief structural check would improve debuggability.
♻️ Optional defensive validation
+ if (!Array.isArray(data.entries)) {
+ throw new Error(
+ `VectorStore: Invalid index structure. Delete "${filePath}" to rebuild.`
+ );
+ }
+
this.entries.clear();
for (const entry of data.entries) {
this.entries.set(entry.chunk.id, {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| this.entries.clear(); | |
| for (const entry of data.entries) { | |
| this.entries.set(entry.chunk.id, { | |
| chunk: entry.chunk, | |
| vector: entry.vector, | |
| }); | |
| } | |
| if (!Array.isArray(data.entries)) { | |
| throw new Error( | |
| `VectorStore: Invalid index structure. Delete "${filePath}" to rebuild.` | |
| ); | |
| } | |
| this.entries.clear(); | |
| for (const entry of data.entries) { | |
| this.entries.set(entry.chunk.id, { | |
| chunk: entry.chunk, | |
| vector: entry.vector, | |
| }); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/vectorStore.ts` around lines 151 - 157, The loaded index JSON needs
structural validation before populating this.entries: verify that the top-level
data is an object and data.entries is an array, then iterate and check each
entry has entry.chunk && entry.chunk.id (string/number) and entry.vector
(array/typed values) before calling this.entries.set; for any invalid/missing
fields either skip the entry with a descriptive processLogger.warn or throw a
clear Error that includes which entry/index failed, so the loader (the method
that currently clears this.entries and loops over data.entries) fails with an
actionable message instead of a cryptic runtime error.
| export default defineConfig({ | ||
| test: { | ||
| globals: true, | ||
| environment: "node", | ||
| include: ["src/**/__tests__/**/*.test.ts"], | ||
| coverage: { | ||
| provider: "v8", | ||
| include: ["src/ai/**/*.ts"], | ||
| exclude: ["src/ai/__tests__/**", "src/ai/index.ts"], | ||
| }, | ||
| }, | ||
| }); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
# Check package.json for `@huggingface/transformers` dependency
grep -i "huggingface\|transformers" package.jsonRepository: AOSSIE-Org/SmartNotes
Length of output: 106
🏁 Script executed:
# Search for any imports of `@huggingface/transformers` in the codebase
rg "@huggingface/transformers|from.*transformers" --type ts --type tsx --type js --type jsonRepository: AOSSIE-Org/SmartNotes
Length of output: 92
🏁 Script executed:
# Read the full vitest.config.ts to see current configuration
cat -n vitest.config.tsRepository: AOSSIE-Org/SmartNotes
Length of output: 531
🏁 Script executed:
# Check if there are any embeddings-related files or tests
fd -e ts -e tsx | grep -i embedRepository: AOSSIE-Org/SmartNotes
Length of output: 121
🏁 Script executed:
# Check for any existing ESM/SSR configuration in vitest config
rg "ssr|inline|external|deps" vitest.config.tsRepository: AOSSIE-Org/SmartNotes
Length of output: 47
🏁 Script executed:
# Check embeddings.ts imports
cat -n src/ai/embeddings.tsRepository: AOSSIE-Org/SmartNotes
Length of output: 4511
🏁 Script executed:
# Check embeddings.test.ts content
cat -n src/ai/__tests__/embeddings.test.tsRepository: AOSSIE-Org/SmartNotes
Length of output: 6516
🏁 Script executed:
# Search for all references to `@huggingface` in the codebase
rg "@huggingface" --type ts --type jsRepository: AOSSIE-Org/SmartNotes
Length of output: 249
Consider adding explicit Vitest configuration for @huggingface/transformers ESM handling.
While @huggingface/transformers (^3.4.0) is a declared dependency and embeddings.ts uses dynamic import with lazy loading, the current Vitest config lacks ESM/external package handling. The test suite currently avoids triggering the real import by injecting mock implementations, so tests pass. However, if code paths exercise the actual initialize() method without a custom embedFn, the Node ESM environment may require explicit server.deps.inline or ssr.noExternal configuration to prevent module resolution issues. This is a precautionary measure to ensure compatibility when real-world code paths are exercised.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@vitest.config.ts` around lines 3 - 14, Add explicit Vitest/Vite ESM handling
for the `@huggingface/transformers` package so tests that exercise real
initialize() paths (e.g., in embeddings.ts where initialize() may dynamic-import
the library or when embedFn is not provided) don't break under Node ESM
resolution; update the vitest.config.ts test config to include the package in
server.deps.inline or ssr.noExternal (or both) to force inlining/noExternal for
"@huggingface/transformers" so the module resolves correctly during tests and
runtime.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/ai/hybridSearch.ts (1)
55-60:⚠️ Potential issue | 🟠 MajorIndex both engines from the same chunk array.
SemanticSearchService.indexNote()already re-chunks internally (src/ai/semanticSearch.ts:42-62), and this method chunks the raw content again for BM25. If those chunkers ever diverge, the same chunk ID can point to different text in each engine and RRF will fuse unrelated hits. Generate the chunks once here and feed that exact array to both indexes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/hybridSearch.ts` around lines 55 - 60, The code currently chunks content twice causing mismatched chunk IDs; call this.chunkFn(noteId, content) once to produce a single chunks array and feed that same array to both engines. Update the call site in hybridSearch.indexNote to use the single chunks array and change SemanticSearchService.indexNote (or add a new semanticSearch.indexChunks) to accept the precomputed chunks array instead of raw content so it does not re-chunk internally; then pass chunks to keywordEngine.indexChunks(chunks) and to the semantic search method, and remove the internal re-chunking in SemanticSearchService.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai/__tests__/hybridSearch.test.ts`:
- Around line 42-47: buildService currently hardcodes VectorStore to
"/tmp/hybrid-test", making tests non-hermetic; change buildService to create a
unique temp directory per invocation (e.g., using fs.mkdtempSync or a UUID
appended to os.tmpdir()) and pass that path into the new VectorStore({
persistDir, indexFilename }) so each test gets its own directory, and ensure the
test suite's beforeEach()/afterEach() (or initialize()/teardown) removes or
cleans that temp dir after the test to avoid cross-test contamination.
In `@src/ai/hybridSearch.ts`:
- Around line 51-53: The initialize() method restores only the semantic store
(semanticSearch) but does not rebuild or restore the BM25 keyword index
(keywordEngine), so keywordEngine is empty after restart; update initialize() to
load persisted chunks/documents (the same source semanticSearch uses) and
rebuild or re-index keywordEngine from those persisted chunks, and update save()
to persist either the keywordEngine state or the chunks/documents used to
rebuild it (so keywordEngine can be reconstructed on startup); reference the
initialize() and save() methods, keywordEngine, semanticSearch and the persisted
chunks/documents used by the vector store when implementing this.
- Around line 76-87: The keyword results are being truncated by
KeywordSearchEngine.search before you apply the noteId filter, causing
note-scoped searches to miss matches; fix by applying noteId filtering inside
the keyword search path: either extend KeywordSearchEngine.search to accept an
optional noteId (and filter before slice) and call keywordEngine.search(query,
broadK, opts.noteId) here, or request the full/un-truncated set (e.g., topK =
Infinity or a large number) so you can filter by opts.noteId in this module
before slicing to broadK; update references to keywordEngine.search and ensure
filteredKeyword is produced from the pre-sliced results.
---
Duplicate comments:
In `@src/ai/hybridSearch.ts`:
- Around line 55-60: The code currently chunks content twice causing mismatched
chunk IDs; call this.chunkFn(noteId, content) once to produce a single chunks
array and feed that same array to both engines. Update the call site in
hybridSearch.indexNote to use the single chunks array and change
SemanticSearchService.indexNote (or add a new semanticSearch.indexChunks) to
accept the precomputed chunks array instead of raw content so it does not
re-chunk internally; then pass chunks to keywordEngine.indexChunks(chunks) and
to the semantic search method, and remove the internal re-chunking in
SemanticSearchService.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 05c80561-5fea-473f-a2bf-67aaf9d9ebff
📒 Files selected for processing (4)
src/ai/__tests__/hybridSearch.test.tssrc/ai/hybridSearch.tssrc/ai/index.tssrc/ai/types.ts
| function buildService(embedFn?: EmbedFn) { | ||
| const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn()); | ||
| const vectorStore = new VectorStore({ | ||
| persistDir: "/tmp/hybrid-test", | ||
| indexFilename: "test-index.json", | ||
| }); |
There was a problem hiding this comment.
Use a unique temp directory per test helper.
buildService() always points VectorStore at /tmp/hybrid-test/test-index.json, and beforeEach() calls initialize(), which can load whatever was left there by another run. That makes this suite non-hermetic. Use a per-test temp dir or clean the path in setup/teardown.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/__tests__/hybridSearch.test.ts` around lines 42 - 47, buildService
currently hardcodes VectorStore to "/tmp/hybrid-test", making tests
non-hermetic; change buildService to create a unique temp directory per
invocation (e.g., using fs.mkdtempSync or a UUID appended to os.tmpdir()) and
pass that path into the new VectorStore({ persistDir, indexFilename }) so each
test gets its own directory, and ensure the test suite's
beforeEach()/afterEach() (or initialize()/teardown) removes or cleans that temp
dir after the test to avoid cross-test contamination.
| async initialize(): Promise<void> { | ||
| await this.semanticSearch.initialize(); | ||
| } |
There was a problem hiding this comment.
Persist or rebuild the BM25 index on startup.
initialize() only restores the semantic store, and save() only persists that same store. After a restart, keywordEngine comes back empty, so hybrid search silently degrades to semantic-only until every note is reindexed. Rebuild BM25 from persisted chunks during initialize(), or persist keyword state alongside the vector store.
Also applies to: 100-101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/hybridSearch.ts` around lines 51 - 53, The initialize() method
restores only the semantic store (semanticSearch) but does not rebuild or
restore the BM25 keyword index (keywordEngine), so keywordEngine is empty after
restart; update initialize() to load persisted chunks/documents (the same source
semanticSearch uses) and rebuild or re-index keywordEngine from those persisted
chunks, and update save() to persist either the keywordEngine state or the
chunks/documents used to rebuild it (so keywordEngine can be reconstructed on
startup); reference the initialize() and save() methods, keywordEngine,
semanticSearch and the persisted chunks/documents used by the vector store when
implementing this.
| const [semanticResults, keywordResults] = await Promise.all([ | ||
| this.semanticSearch.search(query, { | ||
| topK: broadK, | ||
| minScore: 0, | ||
| noteId: opts.noteId, | ||
| }), | ||
| Promise.resolve(this.keywordEngine.search(query, broadK)), | ||
| ]); | ||
|
|
||
| const filteredKeyword = opts.noteId | ||
| ? keywordResults.filter((r) => r.chunk.noteId === opts.noteId) | ||
| : keywordResults; |
There was a problem hiding this comment.
Apply noteId filtering before keyword topK.
KeywordSearchEngine.search() truncates with slice(0, topK) before returning (src/ai/keywordSearch.ts:82-113). Filtering that reduced list afterward means note-scoped searches can miss valid keyword matches whenever other notes occupy the first broadK slots. Move noteId filtering into the keyword search path, or fetch all keyword matches before slicing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/hybridSearch.ts` around lines 76 - 87, The keyword results are being
truncated by KeywordSearchEngine.search before you apply the noteId filter,
causing note-scoped searches to miss matches; fix by applying noteId filtering
inside the keyword search path: either extend KeywordSearchEngine.search to
accept an optional noteId (and filter before slice) and call
keywordEngine.search(query, broadK, opts.noteId) here, or request the
full/un-truncated set (e.g., topK = Infinity or a large number) so you can
filter by opts.noteId in this module before slicing to broadK; update references
to keywordEngine.search and ensure filteredKeyword is produced from the
pre-sliced results.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/ai/hybridSearch.ts (2)
72-83:⚠️ Potential issue | 🟠 MajorFilter keyword hits before truncating to
topK.
KeywordSearchEngine.search()slices before returning, so filtering that reduced list at Line 81 can drop valid note-scoped matches whenever other notes occupy the firstbroadKBM25 slots.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/hybridSearch.ts` around lines 72 - 83, Keyword search results are pre-sliced by KeywordSearchEngine.search, so filtering by opts.noteId after that (into filteredKeyword) can drop valid note-scoped hits; change the flow to fetch untruncated keyword hits (e.g., expose/use a searchAll / unsliced variant or call keywordEngine.search with a sufficiently large K) then apply the noteId filter (opts.noteId) to that full set and only after filtering truncate to broadK (used alongside semanticResults); update usages around keywordEngine.search, filteredKeyword, broadK, and opts.noteId accordingly.
47-49:⚠️ Potential issue | 🟠 MajorPersist or rebuild the BM25 index on startup.
initialize()restores only the semantic store, andsave()persists only that same state. After a restart,keywordEngineis empty, so hybrid search silently degrades to semantic-only until every note is reindexed.Also applies to: 96-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/hybridSearch.ts` around lines 47 - 49, initialize() currently only restores the semantic store so keywordEngine/BM25 is empty after restart; update initialize() (and the save() path referenced near lines 96-98) to persist and restore the BM25/keywordEngine state or trigger a rebuild: add serialization of the BM25 index when save() runs and load/deserializer logic in initialize() to repopulate keywordEngine, and if no serialized snapshot exists call the existing reindex routine (e.g., reindexAll or the method that iterates notes) to rebuild the BM25 index on startup; ensure you reference and update keywordEngine, initialize(), save(), and the BM25 serialization/rebuild helpers so hybridSearch is fully restored after restart.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/ai/hybridSearch.ts`:
- Around line 72-83: Keyword search results are pre-sliced by
KeywordSearchEngine.search, so filtering by opts.noteId after that (into
filteredKeyword) can drop valid note-scoped hits; change the flow to fetch
untruncated keyword hits (e.g., expose/use a searchAll / unsliced variant or
call keywordEngine.search with a sufficiently large K) then apply the noteId
filter (opts.noteId) to that full set and only after filtering truncate to
broadK (used alongside semanticResults); update usages around
keywordEngine.search, filteredKeyword, broadK, and opts.noteId accordingly.
- Around line 47-49: initialize() currently only restores the semantic store so
keywordEngine/BM25 is empty after restart; update initialize() (and the save()
path referenced near lines 96-98) to persist and restore the BM25/keywordEngine
state or trigger a rebuild: add serialization of the BM25 index when save() runs
and load/deserializer logic in initialize() to repopulate keywordEngine, and if
no serialized snapshot exists call the existing reindex routine (e.g.,
reindexAll or the method that iterates notes) to rebuild the BM25 index on
startup; ensure you reference and update keywordEngine, initialize(), save(),
and the BM25 serialization/rebuild helpers so hybridSearch is fully restored
after restart.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bb239b31-74cd-431f-891e-2c3a8539b8d7
📒 Files selected for processing (2)
src/ai/hybridSearch.tssrc/ai/semanticSearch.ts
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/ai/vectorStore.ts (1)
149-165:⚠️ Potential issue | 🟠 MajorValidate every loaded row before rebuilding the map.
This path only proves that the JSON parses and that the first entry matches
vectorDim. Valid JSON with the wrong shape can still fail with opaqueTypeErrors at Line 149 or Line 161, and a later mixed-dimension row slips through until search time. Validateentriesup front and check each row'schunkfields plusvectorlength before populatingthis.entries.Proposed fix
+ if (!Array.isArray(data.entries)) { + throw new Error( + `VectorStore: invalid index structure in "${filePath}". ` + + `Delete the file to rebuild the index.`, + ); + } + - if (data.vectorDim !== undefined && data.entries.length > 0) { - const actualDim = data.entries[0].vector.length; - if (actualDim !== data.vectorDim) { - throw new Error( - `VectorStore: index vectorDim mismatch ` + - `(stored ${data.vectorDim}, found ${actualDim}). Index may be corrupted.`, - ); - } - } - this.entries.clear(); - for (const entry of data.entries) { + for (const [index, entry] of data.entries.entries()) { + if ( + !entry || + !entry.chunk || + typeof entry.chunk.id !== "string" || + typeof entry.chunk.noteId !== "string" || + typeof entry.chunk.content !== "string" || + typeof entry.chunk.chunkIndex !== "number" || + !Array.isArray(entry.vector) + ) { + throw new Error( + `VectorStore: invalid entry at index ${index} in "${filePath}". ` + + `Delete the file to rebuild the index.`, + ); + } + if ( + data.vectorDim !== undefined && + entry.vector.length !== data.vectorDim + ) { + throw new Error( + `VectorStore: invalid vector dimension at entry ${index} in "${filePath}".`, + ); + } this.entries.set(entry.chunk.id, { chunk: entry.chunk, vector: entry.vector, }); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/vectorStore.ts` around lines 149 - 165, Validate the loaded payload before rebuilding this.entries: confirm data.entries is an array and iterate each entry to ensure it has a chunk object with required fields (at least chunk.id) and a numeric vector array whose length matches data.vectorDim (when data.vectorDim is defined) and all elements are numbers; throw a clear Error identifying the offending entry if any check fails; only after all entries pass validation, clear this.entries and populate it using entry.chunk.id -> { chunk: entry.chunk, vector: entry.vector } so malformed rows never get inserted (referencing data.entries, data.vectorDim, entry.chunk, entry.vector, and this.entries).src/ai/__tests__/vectorStore.test.ts (1)
188-199:⚠️ Potential issue | 🟡 MinorAssert the no-op, not just the final contents.
Line 191 says the second
save()should be a no-op, but this test still passes ifsave()rewrites the same JSON. Capture file metadata before/after the second call, or spy on the write path, so the test proves the optimization instead of only re-reading the final file contents.Proposed fix
-import { mkdtemp, rm, readFile, writeFile } from "node:fs/promises"; +import { mkdtemp, rm, readFile, writeFile, stat } from "node:fs/promises"; ... it("skips write when nothing changed", async () => { store.add([makeChunk("n1", 0, [1, 0, 0])]); await store.save(); + const before = await stat(join(tempDir, "test-index.json")); await store.save(); // second save should be no-op + const after = await stat(join(tempDir, "test-index.json")); const content = await readFile( join(tempDir, "test-index.json"), "utf-8", ); const data = JSON.parse(content); + expect(after.mtimeMs).toBe(before.mtimeMs); expect(data.version).toBe(1); expect(data.entries).toHaveLength(1); });As per coding guidelines,
**/*.test.{ts,tsx,js,jsx}: The tests are not tautological.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/__tests__/vectorStore.test.ts` around lines 188 - 199, The test currently only verifies final file contents but must assert the second store.save() is a true no-op; modify the "skips write when nothing changed" test to capture file metadata or spy the write path before the second save (e.g., stat the file mtime or spy on fs.writeFile/Store's internal write method used by save()), call await store.save() the second time, then assert that no write occurred (mtime unchanged or spy not called) in addition to the existing content checks; locate the logic around store.save() and the file read (readFile/join usage) to add the stat/spy and the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai/vectorStore.ts`:
- Around line 28-33: The add(embeddedChunks: EmbeddedChunk[]) method currently
mutates this.entries before validating embedding sizes, allowing mixed-dimension
vectors to be stored; update add to first determine the expected vector length
(use an existing this.dimension property or set it from the first inserted
ec.embedding.length if missing) and validate every ec.embedding.length against
that expected dimension, throwing an error and refusing to mutate this.entries
if any mismatch is found; ensure you set this.dimension on the first successful
insert and only mark this.dirty = true after all entries have been validated and
inserted.
---
Duplicate comments:
In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 188-199: The test currently only verifies final file contents but
must assert the second store.save() is a true no-op; modify the "skips write
when nothing changed" test to capture file metadata or spy the write path before
the second save (e.g., stat the file mtime or spy on fs.writeFile/Store's
internal write method used by save()), call await store.save() the second time,
then assert that no write occurred (mtime unchanged or spy not called) in
addition to the existing content checks; locate the logic around store.save()
and the file read (readFile/join usage) to add the stat/spy and the assertion.
In `@src/ai/vectorStore.ts`:
- Around line 149-165: Validate the loaded payload before rebuilding
this.entries: confirm data.entries is an array and iterate each entry to ensure
it has a chunk object with required fields (at least chunk.id) and a numeric
vector array whose length matches data.vectorDim (when data.vectorDim is
defined) and all elements are numbers; throw a clear Error identifying the
offending entry if any check fails; only after all entries pass validation,
clear this.entries and populate it using entry.chunk.id -> { chunk: entry.chunk,
vector: entry.vector } so malformed rows never get inserted (referencing
data.entries, data.vectorDim, entry.chunk, entry.vector, and this.entries).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f46da6c2-eb49-4565-8c70-ea5406ae89a6
📒 Files selected for processing (2)
src/ai/__tests__/vectorStore.test.tssrc/ai/vectorStore.ts
| add(embeddedChunks: EmbeddedChunk[]): void { | ||
| for (const ec of embeddedChunks) { | ||
| this.entries.set(ec.chunk.id, ec); | ||
| } | ||
| this.dirty = true; | ||
| } |
There was a problem hiding this comment.
Reject mixed-dimension vectors on insert.
search() and findRelated() assume every stored embedding has the same length. Right now one bad chunk can be added successfully, persisted, and then cause every later query to throw Vector dimension mismatch when iteration reaches it. Enforce a single vector dimension before mutating this.entries.
Proposed fix
/** Add chunks. Overwrites existing entries with the same ID. */
add(embeddedChunks: EmbeddedChunk[]): void {
+ const expectedDim =
+ this.entries.values().next().value?.vector.length ??
+ embeddedChunks[0]?.vector.length;
+
for (const ec of embeddedChunks) {
+ if (expectedDim !== undefined && ec.vector.length !== expectedDim) {
+ throw new Error(
+ `VectorStore: expected ${expectedDim}-dim vectors, got ` +
+ `${ec.vector.length} for chunk "${ec.chunk.id}".`,
+ );
+ }
this.entries.set(ec.chunk.id, ec);
}
this.dirty = true;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/vectorStore.ts` around lines 28 - 33, The add(embeddedChunks:
EmbeddedChunk[]) method currently mutates this.entries before validating
embedding sizes, allowing mixed-dimension vectors to be stored; update add to
first determine the expected vector length (use an existing this.dimension
property or set it from the first inserted ec.embedding.length if missing) and
validate every ec.embedding.length against that expected dimension, throwing an
error and refusing to mutate this.entries if any mismatch is found; ensure you
set this.dimension on the first successful insert and only mark this.dirty =
true after all entries have been validated and inserted.
|
@dorodb-web22 really solid first implementation- BM25 + RRF + local embeddings in pure TypeScript with 77 tests is impressive work. A few things I noticed: Blocking concerns:
the per PR fixes you have been pushing (chunker reuse, error descriptiveness, fusionScore) are exactly the right things to do. the 3 blocking items above are the main things I would want to see resolved before merge. happy to discuss any of them... |
Description
This PR introduces Hybrid Search to the AI module, building on top of the recent embedding pipeline by integrating keyword search and combining the results using Reciprocal Rank Fusion (RRF). This ensures that searches are robust against exact-match keywords while still benefiting from semantic understanding.
Key additions:
All 77 AI-related tests are passing locally and TypeScript compilation is clean. I'm open to any feedback on the implementation or tuning parameters!
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Tests
Chores