Skip to content

Introduce diskann-record crate to support serialization + deserialization of DiskANN indexes#1188

Open
suhasjs wants to merge 12 commits into
mainfrom
users/suhasja/saveload-core
Open

Introduce diskann-record crate to support serialization + deserialization of DiskANN indexes#1188
suhasjs wants to merge 12 commits into
mainfrom
users/suhasja/saveload-core

Conversation

@suhasjs

@suhasjs suhasjs commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

This PR is part 1/2 of #1079 and only introduces the diskann-record crate. See #1079 for sample output when the save::Save and load::Load traits from this crate are implemented for a simple in-memory index.

  • Does this PR have a descriptive title that could go in our release notes?
  • Does this PR add any new dependencies?
  • Does this PR modify any existing APIs?
  • Is the change to the API backwards compatible?
  • Should this result in any changes to our documentation, either updating existing docs or adding new ones?

Reference Issues/PRs

Part 1/2 of #1079. Also see #737.

What does this implement/fix? Briefly explain your changes.

Introduces two new traits (diskann_record::save::Save and diskann_record::load::Load)and a new create diskann-record to support file-based serialization + deserialization.

Any other comments?

…oadable] traits [THIS PR] + impls for structs [TODO]
@suhasjs suhasjs self-assigned this Jun 18, 2026
@suhasjs suhasjs requested review from a team and Copilot June 18, 2026 17:52
@suhasjs suhasjs added enhancement New feature or request rust Pull requests that update rust code labels Jun 18, 2026
@suhasjs suhasjs moved this to Done in DiskANN backlog Jun 18, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

Sidecar artifact path handling allows unsafe/incorrect path shapes (including potential directory escape on save) and needs validation hardening before merging.

Pull request overview

Introduces a new diskann-record crate that defines a versioned JSON-manifest + sidecar-artifact framework for persisting DiskANN-related structures, including a Save/Load trait surface, wire-level value model, and basic round-trip tests. This is positioned as the foundational crate for the follow-up PR that will implement these traits for real index types.

Changes:

  • Added save + load modules with Save/Load and Saveable/Loadable traits, plus save_fields! / load_fields! macros.
  • Implemented wire types (Value, Record, Handle), schema Version, and a lossless Number container for manifest numeric values.
  • Integrated the new crate into the workspace (members + workspace dependency) and added initial unit tests validating round-trips and handle escape rejection.
File summaries
File Description
diskann-record/src/lib.rs Crate-level API/docs, reserved-key policy, 64-bit platform assertion, and end-to-end tests.
diskann-record/src/version.rs Defines Version and its string serialization/deserialization form.
diskann-record/src/number.rs Adds Number wire type and safe narrowing conversions.
diskann-record/src/save/mod.rs Save-side traits, entry point, macros, and primitive Saveable impls.
diskann-record/src/save/context.rs Save-side context and sidecar writer + manifest finalization logic.
diskann-record/src/save/error.rs Save-side error wrapper.
diskann-record/src/save/value.rs Wire-level Value/Record/Handle representations and serde behavior.
diskann-record/src/load/mod.rs Load-side traits, entry point, macros, and primitive Loadable impls.
diskann-record/src/load/context.rs Load-side context/object/array APIs and sidecar reader.
diskann-record/src/load/error.rs Load-side error type and recoverable-vs-critical classification.
diskann-record/Cargo.toml New crate manifest and dependencies.
Cargo.toml Adds diskann-record to the workspace and workspace dependencies.
Cargo.lock Records the new workspace package entry.

Copilot's findings

  • Files reviewed: 12/13 changed files
  • Comments generated: 4

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-record/src/save/context.rs Outdated
Comment thread diskann-record/src/save/context.rs Outdated
Comment thread diskann-record/src/load/context.rs Outdated
Comment thread diskann-record/src/save/mod.rs
@codecov-commenter

codecov-commenter commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.30769% with 269 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.65%. Comparing base (3aa44ac) to head (6c51b91).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
diskann-record/src/load/context.rs 67.78% 48 Missing ⚠️
diskann-record/src/load/error.rs 53.01% 39 Missing ⚠️
diskann-record/src/value.rs 82.11% 39 Missing ⚠️
diskann-record/src/backend/disk.rs 83.83% 32 Missing ⚠️
diskann-record/src/lib.rs 87.02% 31 Missing ⚠️
diskann-record/src/save/context.rs 58.82% 21 Missing ⚠️
diskann-record/src/save/mod.rs 70.49% 18 Missing ⚠️
diskann-record/src/number.rs 71.18% 17 Missing ⚠️
diskann-record/src/backend/memory.rs 86.66% 14 Missing ⚠️
diskann-record/src/load/mod.rs 87.93% 7 Missing ⚠️
... and 1 more

❌ Your patch status has failed because the patch coverage (79.30%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1188      +/-   ##
==========================================
+ Coverage   89.46%   89.65%   +0.18%     
==========================================
  Files         487      500      +13     
  Lines       92170    94482    +2312     
==========================================
+ Hits        82460    84706    +2246     
- Misses       9710     9776      +66     
Flag Coverage Δ
miri 89.65% <79.30%> (+0.18%) ⬆️
unittests 89.31% <79.30%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-record/src/save/error.rs 100.00% <100.00%> (ø)
diskann-record/src/version.rs 94.54% <94.54%> (ø)
diskann-record/src/load/mod.rs 87.93% <87.93%> (ø)
diskann-record/src/backend/memory.rs 86.66% <86.66%> (ø)
diskann-record/src/number.rs 71.18% <71.18%> (ø)
diskann-record/src/save/mod.rs 70.49% <70.49%> (ø)
diskann-record/src/save/context.rs 58.82% <58.82%> (ø)
diskann-record/src/lib.rs 87.02% <87.02%> (ø)
diskann-record/src/backend/disk.rs 83.83% <83.83%> (ø)
diskann-record/src/load/error.rs 53.01% <53.01%> (ø)
... and 2 more

... and 27 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hildebrandmw hildebrandmw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suhas - I have one big architectural comment about allowing pluggable backend contexts that I think will address many of the concerns about how heavy this is as a dependency and support for VFS that I think is probably worth doing. Happy to help out if needed.

Comment thread diskann-record/src/save/context.rs Outdated
Comment thread diskann-record/src/value.rs
@suhasjs

suhasjs commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@hildebrandmw Requesting review for the following changes:

  • Moved up value.rs to reflect the shared usage between load/save paths
  • Trait objects to enable VFS pluggability (with a DiskContext impl for disk-based serialization, feature gated)
  • Change SaveContext::write to take Option<&str> to treat the input key as a hint. Filename is now INTEGER-{key}, where INTEGER is just the number of artifacts written so far. Creating a random value would pull in rand, and I didn't want to add that dependency.
  • Improved test coverage

@hildebrandmw hildebrandmw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suhas - one more round on the SaveContext/LoadContext traits. I think it would be helpful to implement a purely in-memory backend to see how it interactws with the reader/writers. Such a backend can work directly with Value and avoid pulling in serde entirely.

Comment thread diskann-record/src/load/context.rs Outdated
Comment thread diskann-record/src/load/context.rs Outdated
Comment thread diskann-record/src/version.rs Outdated
Comment thread diskann-record/src/save/context.rs Outdated
/// [`Writer::finish`] flushes the buffer, closes the file, and returns a [`Handle`].
#[derive(Debug)]
pub struct Writer<'a> {
io: BufWriter<File>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing you might find useful when working through the backend design is developing an in-memory only implementation of SaveContext and LoadContext ;)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the in-memory only impls actually improved the organization of the code now. Previously, the disk-variants of SaveContext and LoadContext were part of the {load|save}/context.rs. After adding the in-memory, I realized I could abstract that into a Backend enum and moved the two backends to backend/{memory|disk}.rs.

This keeps the Context traits in load|save directories and puts the impls under backend.

Comment thread diskann-record/src/save/context.rs Outdated
Comment thread diskann-record/src/save/context.rs Outdated
// the same `key` (or omitting it) still yields a unique file name.
let name = match key {
Some(key) => format!("{:03}-{}", files.len(), key),
None => format!("{:03}", files.len()),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe include a little more information in the file name? Like 001-record.bin?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about adding the .bin suffix, but then decided against it because we already have a .bin format that is pervasive throughout the repo, and there's no guarantee that the the save impl for a struct will look anything like the .bin file we know.

Comment thread diskann-record/src/save/context.rs Outdated
name,
)));
}
let full = self.dir.join(&name);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider using cap_std to systematically force files to be generated in a subdirectory instead of relying on path name manipulation. It also adds a bit of safety to the saving process.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that also pull in a whole new crate? Right now, we only depend on anyhow. Do you think it's worth adding?

suhasjs added 5 commits June 23, 2026 14:20
…ls into it now; added in-memory ONLY variant of SaveContext and LoadContext --> moved to backend/; added an enum Backend to choose between Disk*Context and InMemory*Context; moved Disk*Context to backend/
…ed WriterInner impls for DiskWriter and MemoryWriter; renamed InMemoryContext -> MemoryContext (same for InMemorySaveContext)

@hildebrandmw hildebrandmw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suhas, this is coming together. I love how light-weight it is getting when the disk backend and serde are excluded. I have a few higher-level comments. Mostly about testing and fortifying the unhappy paths in addition to the happy paths.

///
/// The generic [`save`](super::save) entry point is parameterized over this trait so
/// that the base crate carries no hard dependency on any particular implementation.
pub trait SaveContext {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on intending this to be a public extension point from the get-go versus keeping it pub(crate) for now? The issue with it currently is that it's public, but write returns a Writer and Writer::new is just pub(super). So external users cannot implement it. Even if they could, they would still need to create a Handle, which I firmly believe should not have a public constructor. It might be a better idea in the short-term to keep this (and LoadContext) pub(crate) at least initially. It's better to go from private to public than the other way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to keep it public and not pub(crate). I see the issue though. We'll need to also make WriteInner pub as well. For now, I'll make this pub(crate) and we can open it up later if there's a need for it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to keep it public and not pub(crate). I see the issue though. We'll need to also make WriteInner pub as well. For now, I'll make this pub(crate) and we can open it up later if there's a need for it.

Bool(bool),
Number(Number),
String(Cow<'a, str>),
Bytes(Cow<'a, [u8]>),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think Bytes has value, but maybe not in its current form (where it will be store in JSON as integers rather than as a byte-string). We could either remove it, or figure out how to get it represented in binary form in JSON. Thoughts?

pub enum Number {
U64(u64),
I64(i64),
F64(f64),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something subtle here is that this silently loses data when the float64 value is infinity or NaN, as serde_json does not support these representations. We can support this by checking at for these values and using strings "inf", "neg_ing", and "nan" (and then teaching the deserializer to process these values).

}
// Reserve the name so the count advances and concurrent writers cannot collide;
// the placeholder is overwritten with the real bytes by `Writer::finish`.
files.insert(name.clone(), Vec::new());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both backends do this to some extent, but it's maybe a little worse here. Eagerly registering Vec::new() means we cannot differentiate between a writer that was created but dropped before properly finishing, and a valid writer that just didn't write everything.

It's better to make files Mutex<HashMap<String, Option<Vec<u8>>>>. When we first pass out a file, set the corresponding value to None and only over-write it to Vec when finishing. This lets us differentiate properly between successfully completed files. As a bonus, we can also check in SaveContext::finish that all created files finished correctly and error out if this is not the case.

For the disk backend, we can do the same thing.

"artifact file name hint {:?} must be a relative file name with no path \
separators",
key,
)));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior diverges from the disk backend. Can we make name mangling infallible like the disk path?

return Err(save::Error::message(format!(
"file {} already exists",
full.display()
)));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a previous save fails for some reason, it will leave around junk which will then make a second round fail. I think defensively not over-writing is the call, but do we want to support some amount of cleanup on failure?

if version == T::VERSION {
T::load(object)
} else {
T::load_legacy(object)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the built-in impls are not tested for round trippability, and this load_legacy path is completely uncovered. Please add tests for the unhappy paths as well and verify that load_legacy works as expected.

Comment thread diskann-record/README.md
@@ -0,0 +1,14 @@
# DiskANN Record

This crate provides a small framework for persisting structured Rust values as a

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are "structured rust values"

Comment thread diskann-record/README.md
field-by-field plumbing for plain structs. Every record carries a `Version` so loaders
can detect schema changes and either upgrade or fall back through a probing chain.

The goal is to allow crates like `diskann` to checkpoint their state without depending on

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are manifest and binary artifacts compatible across providers or defined per provider?

@@ -0,0 +1,453 @@
/*
* Copyright (c) Microsoft Corporation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the primitives in number.rs, value.rs, and version.rs be used from some existing external crate. Is there a strong reason to define this here

@harsha-simhadri harsha-simhadri left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments inline. will read backed code once you have final updates

@harsha-simhadri harsha-simhadri left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments inline. will read backed code once you have final updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request rust Pull requests that update rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants