Skip to content

implement algorithm 5 for inplace repair and algorithm 6 to clean up …#648

Open
Kartikk1127 wants to merge 2 commits intodatastax:mainfrom
Kartikk1127:improve_markNodeDeleted
Open

implement algorithm 5 for inplace repair and algorithm 6 to clean up …#648
Kartikk1127 wants to merge 2 commits intodatastax:mainfrom
Kartikk1127:improve_markNodeDeleted

Conversation

@Kartikk1127
Copy link
Copy Markdown

Problem

The current markNodeDeleted + cleanup() workflow has O(N) cost per deletion:

  • markNodeDeleted only flips a bit in the deleted set
  • cleanup() scans every node in the graph via nodeStream to find in-neighbors
    of the deleted node, then rebuilds their neighbor lists

This means deletion cost degrades linearly as the graph grows, and crucially it
grows over time as more deletions accumulate.

Solution

The IP-DiskANN paper (arXiv:2502.13826) describes two algorithms that solve this:

Algorithm 5 — In-place deletion repair:
Instead of scanning all N nodes to find in-neighbors, run a GreedySearch toward
the deleted node's vector. Nodes the search visits are the approximate in-neighbors.
This reduces in-neighbor discovery from O(N) to O(DELETION_LD) where DELETION_LD
is the beam width of the search.

The sequence per deletion:

  1. Flip the deleted bit
  2. Update entry point if deleted node is the current entry point
  3. GreedySearch toward deleted node's vector with beam width DELETION_LD
  4. For each visited node z where z → deleted_node exists: repair z's neighbor
    list using the top-DELETION_LD search results as replacement candidates
  5. Physically remove the node

Algorithm 6 — Dangling edge sweep:
Algorithm 5 repairs in-neighbors found via the search path, but greedy search
is approximate and may miss some. Algorithm 6 is a periodic O(N × M) sweep
(no distance calculations) that removes any remaining out-edges pointing to
absent nodes.

Benchmark Results (SIFT-1M, M=16, efConstruction=200, efSearch=200)

100K deletions (10% of index), 1000 query vectors, topK=10:

Batch Deleted Avg/delete Recall@10 Degradation
1/10 10,000 1.55ms 0.9517 0.17%
2/10 20,000 1.54ms 0.9508 0.26%
5/10 50,000 1.52ms 0.9459 0.75%
10/10 100,000 1.49ms 0.9279 2.55%

Baseline recall: 0.9534 → Post-deletion recall: 0.9279 (2.55% degradation)

Key observations:

  • Per-deletion cost is flat across all 10 batches (1.49–1.55ms) — does not grow as deletions accumulate
  • Old O(N) nodeStream approach: ~23ms/delete initially, growing to ~39ms — and compounding
  • 100K deletions completed in 2.5 minutes vs projected ~60 minutes with old approach
  • Recall degradation stays well within acceptable bounds at 10% deletion rate

API changes

markNodeDeleted becomes self-contained — no cleanup() call needed after deletion.
cleanup() is still required before writing to disk.

consolidateDanglingEdges() is a new public method for Algorithm 6 execution.

Implementation

The PR implements the algorithm.

References

@marianotepper

…dangling edges. cleanup() method can be deprecated now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant