implement algorithm 5 for inplace repair and algorithm 6 to clean up …#648
Open
Kartikk1127 wants to merge 2 commits intodatastax:mainfrom
Open
implement algorithm 5 for inplace repair and algorithm 6 to clean up …#648Kartikk1127 wants to merge 2 commits intodatastax:mainfrom
Kartikk1127 wants to merge 2 commits intodatastax:mainfrom
Conversation
…dangling edges. cleanup() method can be deprecated now
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The current
markNodeDeleted+cleanup()workflow has O(N) cost per deletion:markNodeDeletedonly flips a bit in the deleted setcleanup()scans every node in the graph vianodeStreamto find in-neighborsof the deleted node, then rebuilds their neighbor lists
This means deletion cost degrades linearly as the graph grows, and crucially it
grows over time as more deletions accumulate.
Solution
The IP-DiskANN paper (arXiv:2502.13826) describes two algorithms that solve this:
Algorithm 5 — In-place deletion repair:
Instead of scanning all N nodes to find in-neighbors, run a GreedySearch toward
the deleted node's vector. Nodes the search visits are the approximate in-neighbors.
This reduces in-neighbor discovery from O(N) to O(DELETION_LD) where DELETION_LD
is the beam width of the search.
The sequence per deletion:
list using the top-DELETION_LD search results as replacement candidates
Algorithm 6 — Dangling edge sweep:
Algorithm 5 repairs in-neighbors found via the search path, but greedy search
is approximate and may miss some. Algorithm 6 is a periodic O(N × M) sweep
(no distance calculations) that removes any remaining out-edges pointing to
absent nodes.
Benchmark Results (SIFT-1M, M=16, efConstruction=200, efSearch=200)
100K deletions (10% of index), 1000 query vectors, topK=10:
Baseline recall: 0.9534 → Post-deletion recall: 0.9279 (2.55% degradation)
Key observations:
API changes
markNodeDeletedbecomes self-contained — nocleanup()call needed after deletion.cleanup()is still required before writing to disk.consolidateDanglingEdges()is a new public method for Algorithm 6 execution.Implementation
The PR implements the algorithm.
References
@marianotepper