Skip to content

feat(ai): add local embedding pipeline and semantic search#13

Open
dorodb-web22 wants to merge 10 commits intoAOSSIE-Org:mainfrom
dorodb-web22:feat/ai-embedding-pipeline
Open

feat(ai): add local embedding pipeline and semantic search#13
dorodb-web22 wants to merge 10 commits intoAOSSIE-Org:mainfrom
dorodb-web22:feat/ai-embedding-pipeline

Conversation

@dorodb-web22
Copy link

@dorodb-web22 dorodb-web22 commented Mar 6, 2026

Summary

Adds src/ai/ with three modules:

  • EmbeddingService — wraps @xenova/transformers (all-MiniLM-L6-v2) for fully offline embeddings, no API calls. Accepts a pluggable embedFn so tests run without downloading real models.
  • VectorStore — cosine similarity search with JSON persistence to disk. Pure TypeScript, no native bindings — works in Electron without platform-specific builds.
  • SemanticSearchService — full pipeline: chunk → embed → index → search. Also exposes findRelatedNotes() for the related-notes sidebar feature.

The ChunkFn interface is drop-in compatible with the chunking utilities from #12, and the re-indexing flow integrates naturally with the storage layer from #5.

Screenshots / Recordings

N/A — backend modules only, no UI changes.

Additional Notes

  • 51 tests passing, 0 TypeScript errors
  • Run locally: npm install && npm test

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Semantic search: index notes, relevance-ranked search, related-note discovery, pluggable chunking and search options
    • On-device embeddings with a pluggable backend and embedding utilities
    • Persistent vector index with similarity ranking, filtering, and note-scoped queries
    • Centralized AI API surface and shared AI types/defaults
  • Chores

    • Project initialized with Node.js/TypeScript toolchain, build/test configs, and updated ignore patterns
  • Tests

    • Extensive tests for embeddings, chunking, vector store, persistence, and search behaviors

@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a semantic search subsystem: typed data models, an EmbeddingService (pluggable, lazy-loaded), an in-memory VectorStore with persistence, a SemanticSearchService that orchestrates chunking/indexing/search, TypeScript build/test configs, package manifest, .gitignore updates, and comprehensive unit tests.

Changes

Cohort / File(s) Summary
Project config & metadata
\.gitignore, package.json, tsconfig.json, tsconfig.build.json, vitest.config.ts
Added ignore patterns, Node/TypeScript project manifest, TypeScript build/test configs, and Vitest configuration.
Public API aggregate & types
src/ai/index.ts, src/ai/types.ts
Introduced shared AI types, default configs, and a central re-export module exposing embeddings, vector store, semantic search, and related types/constants.
Embedding service
src/ai/embeddings.ts
New EmbeddingService and EmbedFn type: supports pluggable embed functions, lazy dynamic import of transformers, input truncation, embed/embedSingle/embedChunks, and modelId getter.
Vector store & similarity util
src/ai/vectorStore.ts
New VectorStore and cosineSimilarity: in-memory index of EmbeddedChunks, add/remove, search (topK/minScore/note scoping), findRelated, versioned JSON persistence, and utility getters.
Semantic search service & chunking
src/ai/semanticSearch.ts
New SemanticSearchService, ChunkFn, and defaultChunker: integrates embeddings and vector store, supports initialize, indexNote, removeNote, search, findRelatedNotes, save, indexSize and isNoteIndexed.
Unit tests
src/ai/__tests__/embeddings.test.ts, src/ai/__tests__/vectorStore.test.ts, src/ai/__tests__/semanticSearch.test.ts
Comprehensive tests covering embeddings (including mock embedFn), cosine similarity and vector store behaviors (persistence, search, removal), and end-to-end semantic search scenarios with deterministic mock embeddings and isolated temp directories.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant SemanticSearchService as "SemanticSearchService\n(rgba(52,152,219,0.5))"
    participant EmbeddingService as "EmbeddingService\n(rgba(46,204,113,0.5))"
    participant VectorStore as "VectorStore\n(rgba(155,89,182,0.5))"
    participant TransformersLib as "Transformers (lazy)\n(rgba(241,196,15,0.5))"

    User->>SemanticSearchService: initialize()
    activate SemanticSearchService
    SemanticSearchService->>EmbeddingService: initialize()
    activate EmbeddingService
    EmbeddingService->>TransformersLib: lazy load pipeline
    activate TransformersLib
    TransformersLib-->>EmbeddingService: pipeline ready
    deactivate TransformersLib
    EmbeddingService-->>SemanticSearchService: ready
    deactivate EmbeddingService
    SemanticSearchService->>VectorStore: load()
    activate VectorStore
    VectorStore-->>SemanticSearchService: index loaded
    deactivate VectorStore
    SemanticSearchService-->>User: initialized
    deactivate SemanticSearchService

    User->>SemanticSearchService: search(query)
    activate SemanticSearchService
    SemanticSearchService->>EmbeddingService: embed([query])
    activate EmbeddingService
    EmbeddingService->>TransformersLib: extract features
    TransformersLib-->>EmbeddingService: vector
    deactivate EmbeddingService
    SemanticSearchService->>VectorStore: search(queryVector, topK, minScore)
    activate VectorStore
    Note over VectorStore: compute cosine similarities<br/>filter & rank results
    VectorStore-->>SemanticSearchService: SearchResult[]
    deactivate VectorStore
    SemanticSearchService-->>User: ranked results
    deactivate SemanticSearchService
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested labels

Typescript Lang

Poem

🐇 I hop through tokens, soft and spry,
Chunks in burrows, vectors sigh,
Embeddings hum, the index gleams,
Notes find kin in midnight dreams,
A rabbit cheers — search leaps high!

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(ai): add local embedding pipeline and semantic search' accurately and concisely summarizes the main changes—adding offline embedding and semantic search capabilities to the AI module.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai/__tests__/semanticSearch.test.ts`:
- Around line 167-175: The test currently branches around an empty result so it
would pass if findRelatedNotes() returns [], so change the assertions to first
assert that related has at least one element (e.g., related.length > 0) and then
assert that the top result or the set contains the expected note id; update the
test case that calls service.indexNote(...) and
service.findRelatedNotes("ml-intro", 5) to explicitly assert existence before
checking related[0].chunk.noteId === "ml-advanced" (or assert that some item in
related has chunk.noteId "ml-advanced") to prevent a false positive.
- Around line 142-147: The test's noteId filter assertion is vacuous because
results.every(...) passes for an empty array; update the test for the "filters
by noteId" case to first assert that the search returned at least one result
(e.g., expect(results.length).toBeGreaterThan(0) or
expect(results).not.toHaveLength(0)) before verifying each result's chunk.noteId
equals "cooking-note" when calling service.search("any query", { noteId:
"cooking-note", minScore: 0 }).

In `@src/ai/embeddings.ts`:
- Around line 42-59: The current initialization assigns this.initPromise to an
async IIFE that, if it rejects, stays cached and prevents retries; update the
initialization so any error clears the cached this.initPromise before
rethrowing. Concretely, wrap the async IIFE body in a try/catch (or attach a
.catch handler) and on catch set this.initPromise = undefined (or null) and then
rethrow the error; keep the same pipeline/extractor logic and this.embedFn
assignment inside the successful path so subsequent calls to initialize() will
attempt a fresh init if a transient failure occurred.

In `@src/ai/semanticSearch.ts`:
- Around line 116-120: The current getAllChunksForNote uses a zero-vector +
search hack (getAllChunksForNote) which hardcodes a 384-dim vector and a 1000
result cap via VectorStore.search; add a dedicated API on the vector store
(e.g., VectorStore.getChunksByNoteId or an async iterator) that returns all
TextChunk entries for a given noteId without embedding-size assumptions or
result limits, then change getAllChunksForNote to call that new method (removing
the dummyVector and search call) so it reliably returns every chunk for the
note.
- Around line 86-101: The current dedup logic in the method that iterates
this.getAllChunksForNote(noteId) and calls this.vectorStore.findRelated uses a
Set (seen) and first-hit-wins, which can discard higher-scoring hits discovered
later; replace that with a map keyed by result.chunk.noteId (e.g., bestByNote:
Map<string, SearchResult>) and for each related result update the map only if
result.score is greater than the stored entry (or insert if missing), then
produce relatedChunks from the map values, sort by score (b.score - a.score) and
return the topK slice; keep using the same symbols (relatedChunks,
this.getAllChunksForNote, this.vectorStore.findRelated, result.chunk.noteId) to
locate and update the code.
- Around line 49-58: The current indexNote implementation calls
this.vectorStore.removeByNoteId(noteId) before chunking/embedding, risking
permanent loss if chunkFn or embedChunks fail; change the flow in indexNote so
you first generate chunks via this.chunkFn(noteId, content) and call
this.embeddingService.embedChunks(chunks), and only after successful embedding
call this.vectorStore.removeByNoteId(noteId) (or atomically replace via an
upsert method if available) and then this.vectorStore.add(embeddedChunks);
ensure any save() that persists the vectorStore runs only after the add
completes successfully.

In `@src/ai/types.ts`:
- Around line 16-19: The JSDoc for SearchResult.score incorrectly states the
cosine similarity range as 0..1; update the comment for the SearchResult
interface (the score property) to reflect the true raw cosine similarity range
-1..1, or alternatively change the implementation that produces
SearchResult.score (where scores are computed/returned) to explicitly
clamp/normalize values to 0..1 and then update the comment to document that
normalization. Ensure you reference the SearchResult interface and its score
property so the doc and behavior remain consistent.

In `@src/ai/vectorStore.ts`:
- Around line 97-103: The serialized payload only includes version and raw
vectors (see SerializedVectorStore/CURRENT_VERSION and the entries mapping), so
reloading can hide incompatible embeddings (e.g., different EmbeddingService
modelId or vector length) and later blow up with "Vector dimension mismatch";
update the serializer to persist embedding-compatibility metadata (at minimum
embedding modelId and vector dimensionality/length alongside version) when
saving, and update the corresponding load() logic to validate those fields
against the current EmbeddingService/config and throw a hard error (fail closed)
if they differ instead of issuing a warning (replace the warning block currently
around load() checks with an exception). Ensure references to
SerializedVectorStore, load(), and any EmbeddingService.config/modelId and
vector size checks are updated so index rebuild is forced when incompatible.

In `@tsconfig.json`:
- Around line 20-27: Your tsconfig currently uses the exclude pattern
"**/__tests__/**", which prevents test files (e.g., src/ai/__tests__/*) from
being type-checked because both build and typecheck share the same config; fix
this by creating a separate build-only tsconfig (name it e.g.,
tsconfig.build.json) that extends the base tsconfig.json and moves the
"exclude": ["node_modules","dist","**/__tests__/**"] into that build config
while leaving the main tsconfig.json without the test exclusion so npm run
typecheck uses the base config and npm run build uses tsconfig.build.json.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c2347694-9da4-430b-95c6-66663160ee5a

📥 Commits

Reviewing files that changed from the base of the PR and between a3ccb2b and e0077c3.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (12)
  • .gitignore
  • package.json
  • src/ai/__tests__/embeddings.test.ts
  • src/ai/__tests__/semanticSearch.test.ts
  • src/ai/__tests__/vectorStore.test.ts
  • src/ai/embeddings.ts
  • src/ai/index.ts
  • src/ai/semanticSearch.ts
  • src/ai/types.ts
  • src/ai/vectorStore.ts
  • tsconfig.json
  • vitest.config.ts

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tsconfig.json`:
- Around line 20-22: The current tsconfig.json "include" only covers
"src/**/*.ts", so root files like vitest.config.ts are excluded from the tsc
--noEmit typecheck run defined in package.json; update your tooling by either
extending or creating a separate small tsconfig (e.g., tsconfig.typecheck.json)
that includes both "src/**/*.ts" and the root config file(s) (e.g.,
"vitest.config.ts") or expand the existing "include" to also match root-level TS
files, then point the package.json typecheck script to that config so
vitest.config.ts is covered by typechecking.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2ed553a8-602f-451d-861c-8f9e35411f99

📥 Commits

Reviewing files that changed from the base of the PR and between 8ee1020 and 83db71d.

📒 Files selected for processing (3)
  • package.json
  • tsconfig.build.json
  • tsconfig.json

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/ai/semanticSearch.ts (1)

48-59: ⚠️ Potential issue | 🟠 Major

Don't drop the old note index until re-embedding succeeds.

Line 50 removes existing chunks before chunkFn() and embedChunks() complete. If embedding fails (e.g., model error, initialization issue), the note is left unindexed. This loss becomes permanent if save() runs afterward.

Suggested fix: remove old chunks only after successful embedding
 async indexNote(noteId: string, content: string): Promise<number> {
-    this.vectorStore.removeByNoteId(noteId);
-
     const chunks = this.chunkFn(noteId, content);
-    if (chunks.length === 0) return 0;
+    if (chunks.length === 0) {
+        this.vectorStore.removeByNoteId(noteId);
+        return 0;
+    }
 
     const embeddedChunks = await this.embeddingService.embedChunks(chunks);
+    this.vectorStore.removeByNoteId(noteId);
     this.vectorStore.add(embeddedChunks);
 
     return chunks.length;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/semanticSearch.ts` around lines 48 - 59, The code currently calls
vectorStore.removeByNoteId(noteId) before chunking/embedding which can
permanently drop the index if embedding fails; change indexNote (the method) to
first produce chunks via chunkFn(noteId, content) and await
embeddingService.embedChunks(chunks) and only after successful embedding call
vectorStore.removeByNoteId(noteId) followed by vectorStore.add(embeddedChunks);
ensure any thrown errors from embedding propagate so removal never happens on
failure (or use a tmp variable for embeddedChunks and only call removeByNoteId
and add when embeddedChunks is ready).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/ai/semanticSearch.ts`:
- Around line 48-59: The code currently calls vectorStore.removeByNoteId(noteId)
before chunking/embedding which can permanently drop the index if embedding
fails; change indexNote (the method) to first produce chunks via chunkFn(noteId,
content) and await embeddingService.embedChunks(chunks) and only after
successful embedding call vectorStore.removeByNoteId(noteId) followed by
vectorStore.add(embeddedChunks); ensure any thrown errors from embedding
propagate so removal never happens on failure (or use a tmp variable for
embeddedChunks and only call removeByNoteId and add when embeddedChunks is
ready).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 20e85da5-50cf-4d03-9fcb-084054afcf11

📥 Commits

Reviewing files that changed from the base of the PR and between 9a87bee and 9bdabe3.

📒 Files selected for processing (2)
  • src/ai/semanticSearch.ts
  • src/ai/vectorStore.ts

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 6, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
@dorodb-web22 dorodb-web22 force-pushed the feat/ai-embedding-pipeline branch from 3b0db18 to c21392e Compare March 7, 2026 05:47
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant