feat(ai): add local embedding pipeline and semantic search#13
feat(ai): add local embedding pipeline and semantic search#13dorodb-web22 wants to merge 10 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a semantic search subsystem: typed data models, an EmbeddingService (pluggable, lazy-loaded), an in-memory VectorStore with persistence, a SemanticSearchService that orchestrates chunking/indexing/search, TypeScript build/test configs, package manifest, .gitignore updates, and comprehensive unit tests. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant SemanticSearchService as "SemanticSearchService\n(rgba(52,152,219,0.5))"
participant EmbeddingService as "EmbeddingService\n(rgba(46,204,113,0.5))"
participant VectorStore as "VectorStore\n(rgba(155,89,182,0.5))"
participant TransformersLib as "Transformers (lazy)\n(rgba(241,196,15,0.5))"
User->>SemanticSearchService: initialize()
activate SemanticSearchService
SemanticSearchService->>EmbeddingService: initialize()
activate EmbeddingService
EmbeddingService->>TransformersLib: lazy load pipeline
activate TransformersLib
TransformersLib-->>EmbeddingService: pipeline ready
deactivate TransformersLib
EmbeddingService-->>SemanticSearchService: ready
deactivate EmbeddingService
SemanticSearchService->>VectorStore: load()
activate VectorStore
VectorStore-->>SemanticSearchService: index loaded
deactivate VectorStore
SemanticSearchService-->>User: initialized
deactivate SemanticSearchService
User->>SemanticSearchService: search(query)
activate SemanticSearchService
SemanticSearchService->>EmbeddingService: embed([query])
activate EmbeddingService
EmbeddingService->>TransformersLib: extract features
TransformersLib-->>EmbeddingService: vector
deactivate EmbeddingService
SemanticSearchService->>VectorStore: search(queryVector, topK, minScore)
activate VectorStore
Note over VectorStore: compute cosine similarities<br/>filter & rank results
VectorStore-->>SemanticSearchService: SearchResult[]
deactivate VectorStore
SemanticSearchService-->>User: ranked results
deactivate SemanticSearchService
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai/__tests__/semanticSearch.test.ts`:
- Around line 167-175: The test currently branches around an empty result so it
would pass if findRelatedNotes() returns [], so change the assertions to first
assert that related has at least one element (e.g., related.length > 0) and then
assert that the top result or the set contains the expected note id; update the
test case that calls service.indexNote(...) and
service.findRelatedNotes("ml-intro", 5) to explicitly assert existence before
checking related[0].chunk.noteId === "ml-advanced" (or assert that some item in
related has chunk.noteId "ml-advanced") to prevent a false positive.
- Around line 142-147: The test's noteId filter assertion is vacuous because
results.every(...) passes for an empty array; update the test for the "filters
by noteId" case to first assert that the search returned at least one result
(e.g., expect(results.length).toBeGreaterThan(0) or
expect(results).not.toHaveLength(0)) before verifying each result's chunk.noteId
equals "cooking-note" when calling service.search("any query", { noteId:
"cooking-note", minScore: 0 }).
In `@src/ai/embeddings.ts`:
- Around line 42-59: The current initialization assigns this.initPromise to an
async IIFE that, if it rejects, stays cached and prevents retries; update the
initialization so any error clears the cached this.initPromise before
rethrowing. Concretely, wrap the async IIFE body in a try/catch (or attach a
.catch handler) and on catch set this.initPromise = undefined (or null) and then
rethrow the error; keep the same pipeline/extractor logic and this.embedFn
assignment inside the successful path so subsequent calls to initialize() will
attempt a fresh init if a transient failure occurred.
In `@src/ai/semanticSearch.ts`:
- Around line 116-120: The current getAllChunksForNote uses a zero-vector +
search hack (getAllChunksForNote) which hardcodes a 384-dim vector and a 1000
result cap via VectorStore.search; add a dedicated API on the vector store
(e.g., VectorStore.getChunksByNoteId or an async iterator) that returns all
TextChunk entries for a given noteId without embedding-size assumptions or
result limits, then change getAllChunksForNote to call that new method (removing
the dummyVector and search call) so it reliably returns every chunk for the
note.
- Around line 86-101: The current dedup logic in the method that iterates
this.getAllChunksForNote(noteId) and calls this.vectorStore.findRelated uses a
Set (seen) and first-hit-wins, which can discard higher-scoring hits discovered
later; replace that with a map keyed by result.chunk.noteId (e.g., bestByNote:
Map<string, SearchResult>) and for each related result update the map only if
result.score is greater than the stored entry (or insert if missing), then
produce relatedChunks from the map values, sort by score (b.score - a.score) and
return the topK slice; keep using the same symbols (relatedChunks,
this.getAllChunksForNote, this.vectorStore.findRelated, result.chunk.noteId) to
locate and update the code.
- Around line 49-58: The current indexNote implementation calls
this.vectorStore.removeByNoteId(noteId) before chunking/embedding, risking
permanent loss if chunkFn or embedChunks fail; change the flow in indexNote so
you first generate chunks via this.chunkFn(noteId, content) and call
this.embeddingService.embedChunks(chunks), and only after successful embedding
call this.vectorStore.removeByNoteId(noteId) (or atomically replace via an
upsert method if available) and then this.vectorStore.add(embeddedChunks);
ensure any save() that persists the vectorStore runs only after the add
completes successfully.
In `@src/ai/types.ts`:
- Around line 16-19: The JSDoc for SearchResult.score incorrectly states the
cosine similarity range as 0..1; update the comment for the SearchResult
interface (the score property) to reflect the true raw cosine similarity range
-1..1, or alternatively change the implementation that produces
SearchResult.score (where scores are computed/returned) to explicitly
clamp/normalize values to 0..1 and then update the comment to document that
normalization. Ensure you reference the SearchResult interface and its score
property so the doc and behavior remain consistent.
In `@src/ai/vectorStore.ts`:
- Around line 97-103: The serialized payload only includes version and raw
vectors (see SerializedVectorStore/CURRENT_VERSION and the entries mapping), so
reloading can hide incompatible embeddings (e.g., different EmbeddingService
modelId or vector length) and later blow up with "Vector dimension mismatch";
update the serializer to persist embedding-compatibility metadata (at minimum
embedding modelId and vector dimensionality/length alongside version) when
saving, and update the corresponding load() logic to validate those fields
against the current EmbeddingService/config and throw a hard error (fail closed)
if they differ instead of issuing a warning (replace the warning block currently
around load() checks with an exception). Ensure references to
SerializedVectorStore, load(), and any EmbeddingService.config/modelId and
vector size checks are updated so index rebuild is forced when incompatible.
In `@tsconfig.json`:
- Around line 20-27: Your tsconfig currently uses the exclude pattern
"**/__tests__/**", which prevents test files (e.g., src/ai/__tests__/*) from
being type-checked because both build and typecheck share the same config; fix
this by creating a separate build-only tsconfig (name it e.g.,
tsconfig.build.json) that extends the base tsconfig.json and moves the
"exclude": ["node_modules","dist","**/__tests__/**"] into that build config
while leaving the main tsconfig.json without the test exclusion so npm run
typecheck uses the base config and npm run build uses tsconfig.build.json.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c2347694-9da4-430b-95c6-66663160ee5a
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (12)
.gitignorepackage.jsonsrc/ai/__tests__/embeddings.test.tssrc/ai/__tests__/semanticSearch.test.tssrc/ai/__tests__/vectorStore.test.tssrc/ai/embeddings.tssrc/ai/index.tssrc/ai/semanticSearch.tssrc/ai/types.tssrc/ai/vectorStore.tstsconfig.jsonvitest.config.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tsconfig.json`:
- Around line 20-22: The current tsconfig.json "include" only covers
"src/**/*.ts", so root files like vitest.config.ts are excluded from the tsc
--noEmit typecheck run defined in package.json; update your tooling by either
extending or creating a separate small tsconfig (e.g., tsconfig.typecheck.json)
that includes both "src/**/*.ts" and the root config file(s) (e.g.,
"vitest.config.ts") or expand the existing "include" to also match root-level TS
files, then point the package.json typecheck script to that config so
vitest.config.ts is covered by typechecking.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2ed553a8-602f-451d-861c-8f9e35411f99
📒 Files selected for processing (3)
package.jsontsconfig.build.jsontsconfig.json
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/ai/semanticSearch.ts (1)
48-59:⚠️ Potential issue | 🟠 MajorDon't drop the old note index until re-embedding succeeds.
Line 50 removes existing chunks before
chunkFn()andembedChunks()complete. If embedding fails (e.g., model error, initialization issue), the note is left unindexed. This loss becomes permanent ifsave()runs afterward.Suggested fix: remove old chunks only after successful embedding
async indexNote(noteId: string, content: string): Promise<number> { - this.vectorStore.removeByNoteId(noteId); - const chunks = this.chunkFn(noteId, content); - if (chunks.length === 0) return 0; + if (chunks.length === 0) { + this.vectorStore.removeByNoteId(noteId); + return 0; + } const embeddedChunks = await this.embeddingService.embedChunks(chunks); + this.vectorStore.removeByNoteId(noteId); this.vectorStore.add(embeddedChunks); return chunks.length; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/semanticSearch.ts` around lines 48 - 59, The code currently calls vectorStore.removeByNoteId(noteId) before chunking/embedding which can permanently drop the index if embedding fails; change indexNote (the method) to first produce chunks via chunkFn(noteId, content) and await embeddingService.embedChunks(chunks) and only after successful embedding call vectorStore.removeByNoteId(noteId) followed by vectorStore.add(embeddedChunks); ensure any thrown errors from embedding propagate so removal never happens on failure (or use a tmp variable for embeddedChunks and only call removeByNoteId and add when embeddedChunks is ready).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/ai/semanticSearch.ts`:
- Around line 48-59: The code currently calls vectorStore.removeByNoteId(noteId)
before chunking/embedding which can permanently drop the index if embedding
fails; change indexNote (the method) to first produce chunks via chunkFn(noteId,
content) and await embeddingService.embedChunks(chunks) and only after
successful embedding call vectorStore.removeByNoteId(noteId) followed by
vectorStore.add(embeddedChunks); ensure any thrown errors from embedding
propagate so removal never happens on failure (or use a tmp variable for
embeddedChunks and only call removeByNoteId and add when embeddedChunks is
ready).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 20e85da5-50cf-4d03-9fcb-084054afcf11
📒 Files selected for processing (2)
src/ai/semanticSearch.tssrc/ai/vectorStore.ts
3b0db18 to
c21392e
Compare
Summary
Adds
src/ai/with three modules:EmbeddingService— wraps@xenova/transformers(all-MiniLM-L6-v2) for fully offline embeddings, no API calls. Accepts a pluggableembedFnso tests run without downloading real models.VectorStore— cosine similarity search with JSON persistence to disk. Pure TypeScript, no native bindings — works in Electron without platform-specific builds.SemanticSearchService— full pipeline: chunk → embed → index → search. Also exposesfindRelatedNotes()for the related-notes sidebar feature.The
ChunkFninterface is drop-in compatible with the chunking utilities from #12, and the re-indexing flow integrates naturally with the storage layer from #5.Screenshots / Recordings
N/A — backend modules only, no UI changes.
Additional Notes
npm install && npm testChecklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Chores
Tests