Skip to content

Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079

Draft
suhasjs wants to merge 28 commits into
mainfrom
mhildebr/saveload
Draft

Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079
suhasjs wants to merge 28 commits into
mainfrom
mhildebr/saveload

Conversation

@suhasjs
Copy link
Copy Markdown
Contributor

@suhasjs suhasjs commented May 16, 2026

  • Does this PR have a descriptive title that could go in our release notes?
  • Does this PR add any new dependencies?
  • Does this PR modify any existing APIs?
  • Is the change to the API backwards compatible?
  • Should this result in any changes to our documentation, either updating existing docs or adding new ones?

Reference Issues/PRs

Initial attempt at #737

What does this implement/fix? Briefly explain your changes.

Introduces two new traits (diskann_record::save::Save and diskann_record::load::Load)and a new create diskann-record to support file-based serialization + deserialization. This PR also implements the traits for DiskANNIndex (and all nested structs) for a successful prototype.
Sample output can be found here.

Any other comments?

This PR is meant to be in a draft state until we're happy with the abstractions and the results.

Mark Hildebrand and others added 21 commits April 15, 2026 14:32
…borrow for Writer::finish() since it calls std::io::Write::flush() that needs a mutable borrow
… wherever needed + modify writer return type to Result<>
… diskann-providers with a new SingleUseWriteProvider (also for reads with SingleUseReadProvider)
…ad_from_bin<> to only use one read since VectorDataIterator already has the metadata needed for
@suhasjs suhasjs added this to the 2026-05 milestone May 16, 2026
@suhasjs suhasjs self-assigned this May 16, 2026
@suhasjs suhasjs added the enhancement New feature or request label May 16, 2026
@suhasjs suhasjs moved this to In Progress in DiskANN backlog May 16, 2026
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suhas! My comments this round are mainly about enum handling. I think we can do this more cleanly by having enums "handle themselves" rather than lifting it into a somewhat fragile interaction in the Save/Load traits.

Other than that, the code inside diskann-record could use considerable hardening and testing.

Do we want to plan out how plugins like VFS would work in this PR, or save that for the future? I ask because VFS is used pretty heavily in tests, so having a replacement that's compliant with this new system will be required. And it will be a good stress test that we can expand the backend serialization to other formats in the future.

Comment thread diskann-record/src/save/mod.rs Outdated
/// load that the tag's presence matches the corresponding [`Load::IS_ENUM`].
fn variant(&self) -> Option<Cow<'_, str>> {
None
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this for longer than I care to admit, I think we can get rid of enum handling entirely at the trait level. If we consider three cases:

enum UnitEnum {
    A,
    B,
    C,
}

enum InlineStruct {
    A(f32),
    B {
       x: String,
       y: f32,
    }
}

enum Struct {
    A(A),
    B(B),
}

struct A {
    field: f32,
}

struct B {
    x: String,
    y: f32,
}

The serialized representations in JSON could be

{
    "$version": "0.0.0",
    "A": null,
}

{
    $version": "0.0.0",
    "B": {
        "x": "hello world",
        "y": 1,
    }
}

{
    "$version": "0.0.0",
    "A": {
        "$version": "0.0.1",
        "0": 1,
    }
}

For versions 1 and 2, there is just a top level version, which makes sense since the structure of these enums evolves in lockstep with the outer enum. For Version 3, this allows nested objects in enums to have distinct versions naturally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$variant is a little more explicit in how it handles enums. But since most of our enums fall into one of the three variants you described, I've gone ahead and removed $variant handling.

enum ErrorInner {
Light(Kind),
Heavy(anyhow::Error),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I still really like the Light/Heavy idea, if we're honest it's probably overkill and will likely not get implemented ever.

Instead, the main goal I think is to differentiate crucial errors vs "recoverable" errors, where the latter is used for a probing API. In that context, we'd have something like:

struct Error {
    inner: anyhow::Error,
    recoverable: bool,
}

Most of the constructors would imply critical errors, with probing APIs using an additional (_recoverable) API. This keeps normal code simple and the probing API possible and reasonably efficient.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did seem overkill. I've removed it in the latest commit.

"handle references file {:?} which is not registered in the manifest",
key,
)));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding a

if key.contains("..") || key_as_path.is_absolute() {
   return Err(/* ... */);
}

to keep files inside the directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Added it.

* Licensed under the MIT license.
*/

//! Adapters that expose a single in-hand [`Read`]/[`Write`] target as a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go ahead and start updating the exiting code to use io::Read/io::Write directly so we don't need this thing?

Copy link
Copy Markdown
Contributor Author

@suhasjs suhasjs May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By 'existing code', do you mean all code that uses Storage{Read/Write}Provider traits? Not sure I understood your ask here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the code that currently takes Storage{Read/Write}Provider does so by taking the provider and a file name and immediately opening the necessary file. These could all be pretty easily migrated to taking std::io::Read and std::io::Write directly, moving the file open up one level and avoiding the need for this hacky one-off provider.


pub fn get(&self, key: &str) -> Option<&Value<'a>> {
self.record.get(key)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the vein of moving enums to handling themselves, if we add:

impl<'a> Record<'a> {
    fn keys(&self) -> Keys<'_> {
        Keys::new(self.record.key())
    }
}

struct Keys<'a> {
    keys: std::collections::hash_map::Keys<'a, Cow<'a, str>, Value<'a>>,
}

impl<'a> Iterator for Keys<'a> {
    // iterate
}

impl ExactSizeIterator for Keys<'_> {}

Then enums could have a way of inspecting the keys for loading.

@arrayka arrayka linked an issue May 19, 2026 that may be closed by this pull request
suhasjs added 3 commits May 21, 2026 15:28
Resolve conflicts in Cargo.toml (versions + members), diskann-vector/Cargo.toml (dev-deps), diskann-providers/src/storage/bin.rs (single-read load_from_bin), diskann-providers/src/index/wrapped_async.rs (combine tests), diskann/src/graph/index.rs (combine impls + tests), and port FastMemoryVectorProviderAsync / MemoryVectorProviderAsync save/load impls to main's T: VectorRepr (formerly Data: GraphDataType in this branch).
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

Codecov Report

❌ Patch coverage is 78.64476% with 416 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.25%. Comparing base (e2dc9a0) to head (9bfa925).

Files with missing lines Patch % Lines
diskann-providers/src/storage/record_shim.rs 70.17% 68 Missing ⚠️
diskann-record/src/save/value.rs 71.25% 48 Missing ⚠️
diskann-record/src/load/context.rs 75.95% 44 Missing ⚠️
diskann-record/src/load/error.rs 51.42% 34 Missing ⚠️
diskann-record/src/lib.rs 86.61% 32 Missing ⚠️
diskann-record/src/save/context.rs 67.02% 31 Missing ⚠️
...roviders/src/model/graph/provider/async_/common.rs 73.19% 26 Missing ⚠️
diskann-record/src/save/error.rs 0.00% 24 Missing ⚠️
diskann/src/graph/config/mod.rs 83.84% 21 Missing ⚠️
diskann-record/src/number.rs 57.14% 18 Missing ⚠️
... and 12 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1079      +/-   ##
==========================================
- Coverage   89.46%   89.25%   -0.22%     
==========================================
  Files         482      493      +11     
  Lines       91082    93018    +1936     
==========================================
+ Hits        81491    83021    +1530     
- Misses       9591     9997     +406     
Flag Coverage Δ
miri 89.25% <78.64%> (-0.22%) ⬇️
unittests 88.90% <78.64%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-providers/src/index/wrapped_async.rs 64.22% <100.00%> (+4.34%) ⬆️
diskann-providers/src/storage/bin.rs 93.65% <62.50%> (+1.58%) ⬆️
diskann-record/src/load/mod.rs 94.11% <94.11%> (ø)
...ers/src/model/configuration/index_configuration.rs 94.00% <92.06%> (+3.19%) ⬆️
...aph/provider/async_/fast_memory_vector_provider.rs 94.67% <93.90%> (-0.27%) ⬇️
.../src/model/graph/provider/async_/inmem/provider.rs 91.70% <95.00%> (+1.10%) ⬆️
...el/graph/provider/async_/memory_vector_provider.rs 96.62% <92.06%> (-1.66%) ⬇️
diskann/src/graph/config/experimental.rs 89.79% <77.27%> (-10.21%) ⬇️
diskann/src/graph/index.rs 95.91% <66.66%> (-0.25%) ⬇️
.../graph/provider/async_/simple_neighbor_provider.rs 94.44% <93.25%> (-0.03%) ⬇️
... and 13 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more drive-by comments.

Comment thread diskann-record/HANDOFF.md

Keys starting with `$` are reserved for infrastructure (`$version`, `$variant`,
`$handle`). Since Rust identifiers can't start with `$`, the `save_fields!` macro is
inherently safe — no runtime check needed in the macro path.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This struct can be created without any issues:
pub struct Test {
    version: String,
    handle: String,
    variant: String,
}
  1. The choice of programming language shouldn’t determine the field names in a public contract.

Comment on lines +59 to +60
"$version": {
"major": 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version object repeats everywhere. It will bloat this file. Is it a concern? Should we consider tracking version at high level objects only?

Comment thread diskann-record/HANDOFF.md
}
```

The lifetime `'a` allows the save path to borrow from the data being saved (zero-copy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a measurable benefit of introducing a lifetime here? I would say that config load/save is outside of hot path or at least its contribution is negligible to overall RAM/CPU usage, since config is tiny.


impl ContextInner {
pub(super) fn new(metadata: &Path, dir: &Path) -> Result<Self> {
let file = std::fs::File::open(metadata).map_err(|e| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not code directly against the file system. Instead, use Storage{Read/Write}Provider abstraction. It allows to write directly to remote store and provides better testability of code.

Comment thread diskann-record/HANDOFF.md

## Remaining Work

### Value::Bytes Wiring
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we planning to store bytes in a configuration file?

Comment on lines +10 to +13
"$version": {
"major": 0,
"minor": 0,
"patch": 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think major/minor/patch is overkill. Each object should just have a single version number that is not tied to the overall version of diskann.

…/Deserialize<> for Version; updated sample_output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Document and extend index serialization format

5 participants