Skip to content

perf: 8.8x faster cold search, 7.3x less memory#264

Merged
justrach merged 35 commits intomainfrom
release/v0.2.57
Apr 13, 2026
Merged

perf: 8.8x faster cold search, 7.3x less memory#264
justrach merged 35 commits intomainfrom
release/v0.2.57

Conversation

@justrach
Copy link
Copy Markdown
Owner

Summary

  • Cold search: 6.4s / 3,678MB → 0.73s / 507MB (8.8x faster, 7.3x less RAM)
  • Warm search: 0.68s / 219MB (16.8x less RAM vs v0.2.56)
  • Recall: 52/52 intact, all tests pass

Key optimizations in this release branch (35 commits):

  • Switch to c_allocator (libc malloc) for better page reclamation
  • Compact WordHit from 24B → 8B (92% warm RSS reduction)
  • Lazy word index — skip for commands that don't need it
  • Single-pass scan+trigram — eliminate file re-reads
  • Fast read-only workers — skip outline parsing for search
  • Pre-size trigram HashMap + reusable local map
  • Parallel trigram extraction — workers read AND extract trigrams in parallel
  • Lean cold insert (insertBulkNew) — skip removeFile + file_trigrams for cold builds

Test plan

  • zig build test passes
  • Recall 10/10 queries match between old and new binary
  • Cold/warm benchmarks on openclaw-bench (13,867 files)
  • Binary installed and tested interactively

🤖 Generated with Claude Code

justrach and others added 30 commits April 12, 2026 09:46
…ale state

Fixes #227, #246, #247, #248.

Three interrelated bugs in TrigramIndex are fixed together:

1. removeFile (#246): moved path_to_id.remove() before the file_trigrams guard so
   the mapping is always cleaned even when file_trigrams has no entry (leftover from
   a partial OOM-failed indexFile).

2. id_to_path growth (#227, #247): removeFile now adds the freed doc_id to a new
   free_ids freelist and marks the id_to_path slot as "". getOrCreateDocId pops from
   free_ids first, reusing the old slot instead of appending a new one.  After N
   re-indexes of the same file, id_to_path.items.len stays bounded by the number of
   unique files ever indexed.

3. PostingList sorted invariant: reused doc_ids are not max, so plain append would
   break the binary-search invariant.  indexFile now detects whether a slot was
   reused (id_to_path did not grow) and uses getOrAddPosting (sorted insert) for
   reused doc_ids, keeping append for new files.

4. PostingList.removeDocId (#248): replaced O(n) linear scan with the same
   binary-search pattern used by getByDocId — O(log n) search + single orderedRemove.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e, snapshot double-open, searchContent O(n) fallback

Fixes #249, #250, #251, #252, #253.

index.zig (#251): AnyTrigramIndex.candidates and candidatesRegex mmap_overlay
branches no longer leak the result ArrayList's backing buffer when
toOwnedSlice fails under OOM — explicit deinit on the error path.

nuke.zig (#249): rewriteConfigFile now writes to a {path}.tmp file first and
renames atomically, preventing an empty config file if the process is killed
mid-write.  Callers updated to thread the allocator through.

explore.zig (#252): commitParsedFileOwnedOutline adds an errdefer immediately
after word_index.indexFile so that a subsequent trigram_index OOM failure
rolls back word_index to the previous content, keeping the two indexes in sync.

explore.zig (#250): Explorer gains a skip_trigram_files StringHashMap.  Files
indexed with skip_trigram=true are tracked in this set; the searchContent
fallback loop now iterates only skip_trigram_files instead of all outlines,
reducing the fallback from O(all files) to O(skip-trigram files).

snapshot.zig (#253): extracted readSectionsFromFile(file, allocator) helper so
both readSections and readSectionBytes share the header-parsing logic.
readSectionBytes now opens the file once and calls the helper, eliminating the
redundant second openFile call for each section read.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…obustness

watcher.zig (#254): incrementalLoop now stats .git/HEAD mtime before
spawning git rev-parse HEAD.  If the mtime is unchanged we skip the
fork+exec entirely, eliminating a 2-second-cadence subprocess that
accounted for the majority of codedb's background CPU on large repos.

watcher.zig: EventQueue.head/tail were std.atomic.Value(usize) even
though every access (push and pop) already holds self.mu.  Replaced with
plain usize fields; the mutex provides all required ordering guarantees.

store.zig: Store.seq was std.atomic.Value(u64) even though the only
mutation site (appendVersion) holds self.mu.  Changed to a plain u64;
currentSeq() now also acquires the mutex so the type is correct.

snapshot.zig: readSectionString limit raised from 4096 to
std.math.maxInt(u16) so symbol names longer than 4 KiB are accepted.
loadSnapshotFast treats a corrupt OUTLINE_STATE section as an empty map
rather than propagating the error, matching the ba13aed fix on the
feature/243 branch.

lib.zig: export snapshot module so callers can reach it through lib.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…llback

Two new tests covering the fixes that landed in this branch:

1. "snapshot: symbol detail longer than 4096 bytes survives round-trip"
   Indexes a function whose signature line is ~5000 chars, writes and
   reloads the snapshot. Guards against readSectionString rejecting
   details > 4096 bytes (the pre-fix max_len).

2. "snapshot: corrupted OUTLINE_STATE section falls back to CONTENT load"
   Overwrites OUTLINE_STATE bytes with 0xFF after writeSnapshot, then
   calls loadSnapshot. The catch fallback must produce an empty outline
   map so loadSnapshotFast re-indexes all files from CONTENT instead.
   Verifies loadSnapshot returns true and symbols remain findable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds comprehensive v0.2.57 CHANGELOG entry covering worker-local indexing
(10×), full nuke/uninstall, MCP timeout fix, Rosetta stack fix, help CLI
fix, and all 9 correctness fixes (index id growth, stale entries, git HEAD
mtime gate, atomic removal, snapshot robustness).

Adds src/benchmark.zig with `zig build benchmark -- --root /path/to/repo`
measuring index time, query latency, re-index slot reuse, and .git/HEAD
mtime gate effectiveness. Updates README with openclaw benchmark table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…searchContent word_index fallback, Symbol.line_end population

- #210: release raw file contents after ProjectCache snapshot load (4.5GB→~200MB)
- #228: check mtime/size before re-indexing in drainNotifyFile, skip unchanged files
- #253: loadSnapshotValidated opens snapshot file once instead of 5 times
- #250: searchContent uses word_index to narrow fallback from O(files) to O(word hits)
- #224: computeSymbolEnds post-processing populates Symbol.line_end for brace/indent/Ruby languages; codedb_symbol body=true now returns full function body

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ng (#108, #215, #216)

- #216: add missing php/ruby/hcl/r entries to telemetry writeLanguages array
- #108: add HCL language support — resource, data, module, variable, output,
  provider, locals, terraform blocks; .tf/.tfvars/.hcl detection; #//* block
  comment handling; .terragrunt-cache in skip_dirs
- #215: add R language support — function assignment (<-/=), setClass/setRefClass,
  library/require imports; .r/.R detection; # comment handling
- 10 new tests covering all HCL and R parser paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etry

feat: HCL/Terraform + R language support, telemetry fix (#108, #215, #216)
- Python docstring: replace naive triple-quote count with position-aware
  detection — properly handles inline docstrings ("""text"""), opening
  docstrings with text ("""starts here), and multi-line docstrings
  containing def/class lines
- Snapshot JSON: use writeJsonEscaped for path interpolation in snapshot
  writer — prevents cache corruption for files with ", \, or control
  characters in paths

Note: 4 of 8 bugs from #179 were already fixed in prior commits
(C/C++ block comments, u16 truncation, ANSI strip, telemetry race)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Python docstring detection and snapshot JSON injection (#179)
)

Explorer.contents was an unbounded StringHashMap holding all file contents
in memory (1.7GB peak RSS on 5K-file repos). Replace with a fixed-size
ContentCache using CLOCK (second-chance) eviction:

- 4096-entry slot array with reference bits
- O(1) path→slot lookup via StringHashMap
- Hot files (recently searched/read) stay cached
- Cold files evicted on sweep, readContentForSearch falls back to disk
- Prior content duped before cache eviction to preserve #252 errdefer
  word_index restoration on OOM

Expected: peak RSS drops from ~1.7GB to ~200MB on large repos while
maintaining identical query behavior (cache misses served from disk).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CLOCK eviction cache for file contents (#208)
This reverts commit 40183a0, reversing
changes made to 3473138.
During cold indexing, commitParsedFileOwnedOutline duped ALL file
contents into a HashMap. On openclaw (13K files) this added ~170MB
of peak RSS for content alone. The indexes (word, trigram) consume
the content parameter directly — the cache is only needed for
readContentForSearch which already has a disk fallback.

Skip content storage when outline count > 1000. First 1000 files
stay cached for fast search; beyond that, search falls back to
disk reads. Snapshot fast-load uses OUTLINE_STATE (not CONTENT),
so startup is unaffected.

Benchmark (openclaw, 13,867 files, cold search):
  v0.2.56:        3,678MB peak RSS  6.16s
  pre-clock:      3,559MB peak RSS  5.66s
  skip-cache:     3,415MB peak RSS  6.07s  (-7.2% RSS vs baseline)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: skip content cache beyond 1000 files — 7% RSS reduction (#208)
Add shrinkPostingLists() to TrigramIndex and shrinkAllocations() to
WordIndex. Both release ArrayList over-allocation (capacity > length)
after initial scan completes. This reduces steady-state RSS for
long-running MCP servers by reclaiming ~50% of ArrayList capacity waste.

Note: peak RSS during cold indexing is unchanged — the shrink runs
after the peak. The peak is dominated by GPA page retention from
alloc/free churn during indexing. Further reduction would require
a custom allocator or pre-sized flat storage.

Benchmark (openclaw, 13,867 files):
  Peak RSS unchanged (3,415MB) — expected, shrink runs after peak
  Recall: 52/52 for 'handleRequest' — no false negatives
  An earlier cap approach (MAX_POSTINGS=512) saved 243MB peak but
  dropped recall to 2/52 — reverted in favor of shrink-only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: shrink index allocations after scan — reduce steady-state RSS (#261)
Three changes to reduce RSS:

1. Back worker arenas with page_allocator instead of GPA — mmap pages
   are returned to OS immediately on arena deinit (no GPA retention)
2. Free each worker's arena right after committing its results instead
   of holding all workers' data simultaneously
3. shrinkPostingLists/shrinkAllocations on trigram + word indexes after
   scan to release ArrayList over-allocation

Benchmark (openclaw, 13,867 files, cold search):
  v0.2.56 baseline:    3,678MB  5.82s
  PR#260 (content):    3,415MB  5.26s
  This PR:             3,361MB  5.64s  (-8.6% RSS vs baseline)
  Recall: 52/52 — no false negatives

The remaining ~3.3GB is genuinely live index data (trigram posting
lists + word index hits). Further reduction needs flat array storage
or compressed postings (tracked in #261).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: page_allocator for worker arenas + eager free + index shrink (#261)
…wap (#261)

Three changes that together reduce RSS by 80% on warm runs:

1. Compact file_words: replace inner StringHashMap(void) per file with
   []const []const u8 slices — saves ~70KB→32KB per file (14K files =
   ~530MB theoretical savings from eliminating HashMap bucket arrays)

2. page_allocator arena for word index words_set: temporary per-file
   HashMap uses mmap-backed arena so pages are returned to OS immediately
   instead of GPA retention

3. CLI mmap swap: after cold indexing + writeToDisk, immediately load
   trigram index as MmapTrigramIndex and release the heap version.
   Also call releaseContents + shrinkAllocations on the CLI path.

Benchmark (openclaw, 13,867 files):
  v0.2.56 baseline:      5.8s   3,678MB  (cold)
  Previous (PR#263):      5.6s   3,361MB  (cold, -8.6%)
  This commit cold:       6.4s   3,209MB  (cold, -12.8%)
  This commit warm:       1.4s     741MB  (warm, -79.8%)

The cold peak (3.2GB) is from heap trigram index during initial build.
Subsequent runs use mmap (741MB) — the realistic MCP server scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… RSS reduction (#261)

The GPA (GeneralPurposeAllocator) was retaining ~1.8GB of dead pages
during cold indexing — pages freed by HashMap resizes, ArrayList growth,
and content read/free cycles were never returned to the OS. Switching to
c_allocator (libc malloc) lets macOS's magazine allocator reclaim freed
pages via madvise(MADV_FREE).

Also: indexFileContent now uses a page_allocator-backed arena for file
content reads, ensuring content pages are munmap'd immediately after
indexing each file. And cold CLI path skips trigrams during scan, builds
them file-by-file afterward to avoid holding all three indexes at once.

Benchmark (openclaw, 13,867 files):
  v0.2.56 GPA baseline:  5.8s  3,678MB cold
  All fixes + GPA:        6.6s  3,188MB cold  (-13%)
  All fixes + c_alloc:    6.0s  1,415MB cold  (-61.5%)
  Warm (mmap + c_alloc):  1.3s    486MB warm  (-86.8%)

Recall: 52/52 — intact. All tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#261)

During cold CLI runs, build trigrams into a separate TrigramIndex backed
by a page_allocator arena. After writing to disk, arena.deinit() returns
ALL trigram pages to the OS via munmap — the trigram heap never coexists
with word index peak allocations. Also shrink word index BEFORE trigram
rebuild to release ArrayList capacity waste early.

Benchmark (openclaw, 13,867 files):
  v0.2.56 baseline:     5.8s   3,678MB cold
  Previous (c_alloc):    6.0s   1,415MB cold
  This commit:           6.2s   1,304MB cold  (-64.5%)
  Warm:                  2.9s     463MB warm  (-87.4%)

Recall: 52/52 intact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…261)

During cold CLI runs, persist word index to disk then free it BEFORE
building trigrams. After trigrams are written + mmap-swapped, reload
word index from disk. This prevents word_index (~500MB) and trigram
index (~400MB) from coexisting in memory simultaneously.

The staggered approach also makes cold runs 46% faster because the
trigram arena operates with more available memory (less GC pressure).

Benchmark (openclaw, 13,867 files):
  v0.2.56 baseline:    5.8s   3,678MB cold
  Previous (c_alloc):  6.2s   1,304MB cold
  This commit:         3.1s   1,078MB cold  (-70.7% vs baseline)
  Warm:                1.3s     464MB warm  (-87.4% vs baseline)

Recall: 52/52 intact. All tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…reduction (#261)

Two changes that slash cold peak RSS to 617MB (from 3,678MB baseline):

1. Use c_allocator (not ArenaAllocator) for the temporary TrigramIndex
   during cold trigram rebuild. ArenaAllocator never frees intermediate
   allocations (HashMap resizes, ArrayList growth), accumulating ~2x the
   actual data. c_allocator returns freed pages to OS on every resize.

2. Skip file_words tracking during bulk scan (skip_file_words flag).
   file_words maps every file→words for removeFile support, but during
   initial scan no files are removed. Saves ~450MB of compact slices.

Benchmark (openclaw, 13,867 files):
  v0.2.56 baseline:    5.8s   3,678MB cold
  Previous:            3.1s   1,078MB cold
  This commit:         3.0s     617MB cold  (-83.2%)
  Warm:                1.2s     423MB warm  (-88.5%)

Recall: 52/52 intact. All tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Minor additional squeeze: give word index c_allocator during scan for
consistent page reclamation, and skip_file_words during bulk indexing.

Final benchmarks (openclaw, 13,867 files):

  v0.2.56 baseline:    5.8s   3,678MB cold
  This session total:  2.8s     606MB cold  (-83.5%, -52% speed)
  Warm (mmap):         1.2s     423MB warm  (-88.5%)

  Remaining 606MB = ~43KB/file (outlines + word_index + c_allocator
  overhead). Floor without word index is 595MB.

Memory breakdown of optimizations:
  GPA → c_allocator:      -1,773MB (page retention eliminated)
  Stagger word/trigram:      -337MB (never coexist in memory)
  Content cache limit:       -170MB (skip dupes beyond 1000)
  Trigram c_allocator:       -472MB (vs ArenaAllocator 2x waste)
  Skip file_words:            -11MB (marginal, balanced by other phase)
  page_allocator workers:     -54MB (content reads munmap'd)
  Other (shrink, etc):       -155MB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase default scan worker cap from min(cpu,4) to min(cpu,8).
  8 workers slightly faster and uses LESS RSS than 4 (smaller per-worker
  arenas): 2.47s/608MB → 2.42s/575MB on openclaw.
- Refactor trigram rebuild to collect paths into ArrayList first
  (prep for future parallel trigram build).

Final stable benchmarks (openclaw, 13,867 files, 3 runs averaged):
  v0.2.56:  6.4s  3,678MB  (cold)
  NOW:      2.4s    597MB  (cold)  -63% speed, -84% RSS
  Warm:     1.2s    423MB

Recall: 52/52 intact. All tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standalone thread-safe trigram extraction and sequential insertion API.
extractTrigrams builds a local HashMap(Trigram, PostingMask) from content
with no shared state; insertExtracted inserts pre-extracted results.

Note: parallel trigram rebuild was tested but caused 8x regression
(2.4s→19s) due to per-file HashMap overhead and thread management.
Sequential rebuild is already fast because OS caches files from scan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace WordHit { path: []const u8, line_num: u32 } (24 bytes) with
WordHit { doc_id: u32, line_num: u32 } (8 bytes). Add path_to_id +
id_to_path mapping to WordIndex, similar to TrigramIndex.

This saves 16 bytes per word hit. On openclaw (13,867 files, ~21M hits),
warm RSS drops from 423MB to 288MB. Cold RSS unchanged (word index is
freed before trigram peak in staggered build).

Benchmark (openclaw, 13,867 files):
  v0.2.56 baseline:     6.4s  3,678MB cold
  NOW cold:             2.3s    620MB cold  (-83% RSS, -64% speed)
  NOW warm:             1.2s    288MB warm  (-92% RSS)
  Recall: 52/52 intact

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
justrach and others added 5 commits April 13, 2026 16:34
search/find/tree/outline use trigram index and outlines — the word index
is only needed for the `word` command. Skip building it during scan for
other commands, eliminating ~0.5s of tokenization + HashMap work.

Benchmark (openclaw, 13,867 files):
  v0.2.56 baseline:     6.4s  3,678MB cold
  Previous:             2.3s    620MB cold / 288MB warm
  NOW cold:             1.8s    600MB cold  (-72% speed vs baseline)
  NOW warm:             0.7s    219MB warm  (-94% RSS vs baseline)

Recall: 52/52 intact. All tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For `codedb search` cold runs, build trigrams during the initial scan
commit loop using worker-arena content instead of re-reading files
from disk in a separate pass. Saves ~0.15s of file I/O.

Benchmark (openclaw, 13,867 files):
  v0.2.56 baseline:    6.4s  3,678MB cold
  Previous:            1.8s    600MB cold
  NOW cold:            1.65s   605MB cold  (-74% speed vs baseline)
  Warm:                0.68s   219MB warm  (-94% RSS vs baseline)

Recall: 52/52 intact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For cold `search`, workers just read files without outline parsing
(no Explorer creation, no line-by-line symbol extraction). Saves
~53MB RSS and avoids outline allocation overhead.

Benchmark (openclaw, 13,867 files, cold search):
  v0.2.56:  6.4s  3,678MB
  NOW:      1.49s   552MB  (-77% speed, -85% RSS)
  Warm:     0.68s   219MB  (-94% RSS)

Recall: 52/52 intact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-allocate trigram index HashMap to 131K capacity and path_to_id
to file count, avoiding resize copies during bulk insert. Also add
indexFileReuse that takes a caller-provided local HashMap, reusing
it across files via clearRetainingCapacity (eliminates 14K alloc/free
cycles for the per-file trigram extraction map).

Benchmark (openclaw, 13,867 files, cold search):
  v0.2.56:  6.4s  3,678MB
  NOW:      1.43s   547MB  (-78% speed, -85% RSS)
  Warm:     0.68s   219MB  (-94% RSS)

Recall: 52/52 intact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… search

Workers now read files AND extract trigrams in parallel (trigramExtractWorker),
instead of reading in parallel then extracting sequentially. The main thread
only does the lean global index merge (insertBulkNew) which skips removeFile
and file_trigrams tracking — both unnecessary for cold builds.

Cold search: 1.43s/547MB → 0.73s/507MB (2x faster, 8% less RSS)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Threshold: 10.00%

Tool Base (ns) Head (ns) Delta Status
codedb_bundle 748520 707088 -5.54% OK
codedb_changes 120173 120164 -0.01% OK
codedb_deps 33781 33962 +0.54% OK
codedb_edit 20118 26307 +30.76% FAIL
codedb_find 155668 152289 -2.17% OK
codedb_hot 160089 161455 +0.85% OK
codedb_outline 449450 463993 +3.24% OK
codedb_read 167590 159275 -4.96% OK
codedb_search 311956 312327 +0.12% OK
codedb_snapshot 4267795 4624880 +8.37% OK
codedb_status 157792 169644 +7.51% OK
codedb_symbol 69681 67984 -2.44% OK
codedb_tree 94182 141470 +50.21% FAIL
codedb_word 103230 132474 +28.33% FAIL

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 161d39e3da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

} else {
self.trigram_index.removeFile(stable_path);
self.sparse_ngram_index.removeFile(stable_path);
try self.skip_trigram_files.put(stable_path, {});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove skipped-file keys when deleting indexed files

skip_trigram_files.put(stable_path, {}) stores a borrowed slice to stable_path, but Explorer.removeFile() never removes that key before freeing the outline path. searchContent() later iterates skip_trigram_files and dereferences those stale slices, so deleting/renaming a file that was indexed with skip_trigram=true can leave dangling pointers that produce undefined paths and unstable search behavior (including potential crashes).

Useful? React with 👍 / 👎.

// Skip file_words tracking during bulk scan — saves ~450MB.
// Only needed for removeFile (incremental re-indexing), not initial scan.
explorer.word_index.skip_file_words = true;
if (!needs_word_index) explorer.word_index.enabled = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep word index enabled for serve/snapshot startup scans

needs_word_index is only true for the CLI word command, so this disables word indexing for serve, snapshot, and other startup scans. On a cold codedb serve run (no usable snapshot), /explore/word queries run against an empty index; and codedb snapshot can persist an empty on-disk word index because wordIndexIsComplete() remains true and no rebuild is triggered.

Useful? React with 👍 / 👎.

// For other commands: outline-only scan, trigrams from disk or rebuild.
const is_search = std.mem.eql(u8, cmd, "search");
if (is_search and !heads_match) {
const tmp_tri = try watcher.initialScanWithTrigrams(&store, &explorer, root, allocator, std.heap.c_allocator, true);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve outline fallback on cold search indexing path

For codedb search when the trigram cache is stale (!heads_match), this call passes skip_outlines=true, so the scan builds only trigrams and leaves explorer.outlines empty. searchContent() depends on outlines for fallback scanning (e.g., short queries and files excluded from trigram indexing such as >64KB files), so cold searches can miss valid matches until a later full scan path is taken.

Useful? React with 👍 / 👎.

@justrach justrach merged commit 0a73acf into main Apr 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant