Introduce diskann-record crate to support serialization + deserialization of DiskANN indexes#1188
Introduce diskann-record crate to support serialization + deserialization of DiskANN indexes#1188suhasjs wants to merge 12 commits into
diskann-record crate to support serialization + deserialization of DiskANN indexes#1188Conversation
…oadable] traits [THIS PR] + impls for structs [TODO]
There was a problem hiding this comment.
⚠️ Not ready to approve
Sidecar artifact path handling allows unsafe/incorrect path shapes (including potential directory escape on save) and needs validation hardening before merging.
Pull request overview
Introduces a new diskann-record crate that defines a versioned JSON-manifest + sidecar-artifact framework for persisting DiskANN-related structures, including a Save/Load trait surface, wire-level value model, and basic round-trip tests. This is positioned as the foundational crate for the follow-up PR that will implement these traits for real index types.
Changes:
- Added
save+loadmodules withSave/LoadandSaveable/Loadabletraits, plussave_fields!/load_fields!macros. - Implemented wire types (
Value,Record,Handle), schemaVersion, and a losslessNumbercontainer for manifest numeric values. - Integrated the new crate into the workspace (members + workspace dependency) and added initial unit tests validating round-trips and handle escape rejection.
File summaries
| File | Description |
|---|---|
| diskann-record/src/lib.rs | Crate-level API/docs, reserved-key policy, 64-bit platform assertion, and end-to-end tests. |
| diskann-record/src/version.rs | Defines Version and its string serialization/deserialization form. |
| diskann-record/src/number.rs | Adds Number wire type and safe narrowing conversions. |
| diskann-record/src/save/mod.rs | Save-side traits, entry point, macros, and primitive Saveable impls. |
| diskann-record/src/save/context.rs | Save-side context and sidecar writer + manifest finalization logic. |
| diskann-record/src/save/error.rs | Save-side error wrapper. |
| diskann-record/src/save/value.rs | Wire-level Value/Record/Handle representations and serde behavior. |
| diskann-record/src/load/mod.rs | Load-side traits, entry point, macros, and primitive Loadable impls. |
| diskann-record/src/load/context.rs | Load-side context/object/array APIs and sidecar reader. |
| diskann-record/src/load/error.rs | Load-side error type and recoverable-vs-critical classification. |
| diskann-record/Cargo.toml | New crate manifest and dependencies. |
| Cargo.toml | Adds diskann-record to the workspace and workspace dependencies. |
| Cargo.lock | Records the new workspace package entry. |
Copilot's findings
- Files reviewed: 12/13 changed files
- Comments generated: 4
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (79.30%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1188 +/- ##
==========================================
+ Coverage 89.46% 89.65% +0.18%
==========================================
Files 487 500 +13
Lines 92170 94482 +2312
==========================================
+ Hits 82460 84706 +2246
- Misses 9710 9776 +66
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Suhas - I have one big architectural comment about allowing pluggable backend contexts that I think will address many of the concerns about how heavy this is as a dependency and support for VFS that I think is probably worth doing. Happy to help out if needed.
… and load paths now use ; feature gated serde and disk impls
…ite a Vec<struct> where each struct might produce a file with the same key
|
@hildebrandmw Requesting review for the following changes:
|
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Suhas - one more round on the SaveContext/LoadContext traits. I think it would be helpful to implement a purely in-memory backend to see how it interactws with the reader/writers. Such a backend can work directly with Value and avoid pulling in serde entirely.
| /// [`Writer::finish`] flushes the buffer, closes the file, and returns a [`Handle`]. | ||
| #[derive(Debug)] | ||
| pub struct Writer<'a> { | ||
| io: BufWriter<File>, |
There was a problem hiding this comment.
One thing you might find useful when working through the backend design is developing an in-memory only implementation of SaveContext and LoadContext ;)
There was a problem hiding this comment.
Adding the in-memory only impls actually improved the organization of the code now. Previously, the disk-variants of SaveContext and LoadContext were part of the {load|save}/context.rs. After adding the in-memory, I realized I could abstract that into a Backend enum and moved the two backends to backend/{memory|disk}.rs.
This keeps the Context traits in load|save directories and puts the impls under backend.
| // the same `key` (or omitting it) still yields a unique file name. | ||
| let name = match key { | ||
| Some(key) => format!("{:03}-{}", files.len(), key), | ||
| None => format!("{:03}", files.len()), |
There was a problem hiding this comment.
Maybe include a little more information in the file name? Like 001-record.bin?
There was a problem hiding this comment.
I thought about adding the .bin suffix, but then decided against it because we already have a .bin format that is pervasive throughout the repo, and there's no guarantee that the the save impl for a struct will look anything like the .bin file we know.
| name, | ||
| ))); | ||
| } | ||
| let full = self.dir.join(&name); |
There was a problem hiding this comment.
We can consider using cap_std to systematically force files to be generated in a subdirectory instead of relying on path name manipulation. It also adds a bit of safety to the saving process.
There was a problem hiding this comment.
Doesn't that also pull in a whole new crate? Right now, we only depend on anyhow. Do you think it's worth adding?
…ls into it now; added in-memory ONLY variant of SaveContext and LoadContext --> moved to backend/; added an enum Backend to choose between Disk*Context and InMemory*Context; moved Disk*Context to backend/
…ed WriterInner impls for DiskWriter and MemoryWriter; renamed InMemoryContext -> MemoryContext (same for InMemorySaveContext)
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Suhas, this is coming together. I love how light-weight it is getting when the disk backend and serde are excluded. I have a few higher-level comments. Mostly about testing and fortifying the unhappy paths in addition to the happy paths.
| /// | ||
| /// The generic [`save`](super::save) entry point is parameterized over this trait so | ||
| /// that the base crate carries no hard dependency on any particular implementation. | ||
| pub trait SaveContext { |
There was a problem hiding this comment.
What are your thoughts on intending this to be a public extension point from the get-go versus keeping it pub(crate) for now? The issue with it currently is that it's public, but write returns a Writer and Writer::new is just pub(super). So external users cannot implement it. Even if they could, they would still need to create a Handle, which I firmly believe should not have a public constructor. It might be a better idea in the short-term to keep this (and LoadContext) pub(crate) at least initially. It's better to go from private to public than the other way.
There was a problem hiding this comment.
I want to keep it public and not pub(crate). I see the issue though. We'll need to also make WriteInner pub as well. For now, I'll make this pub(crate) and we can open it up later if there's a need for it.
There was a problem hiding this comment.
I want to keep it public and not pub(crate). I see the issue though. We'll need to also make WriteInner pub as well. For now, I'll make this pub(crate) and we can open it up later if there's a need for it.
| Bool(bool), | ||
| Number(Number), | ||
| String(Cow<'a, str>), | ||
| Bytes(Cow<'a, [u8]>), |
There was a problem hiding this comment.
I do think Bytes has value, but maybe not in its current form (where it will be store in JSON as integers rather than as a byte-string). We could either remove it, or figure out how to get it represented in binary form in JSON. Thoughts?
| pub enum Number { | ||
| U64(u64), | ||
| I64(i64), | ||
| F64(f64), |
There was a problem hiding this comment.
Something subtle here is that this silently loses data when the float64 value is infinity or NaN, as serde_json does not support these representations. We can support this by checking at for these values and using strings "inf", "neg_ing", and "nan" (and then teaching the deserializer to process these values).
| } | ||
| // Reserve the name so the count advances and concurrent writers cannot collide; | ||
| // the placeholder is overwritten with the real bytes by `Writer::finish`. | ||
| files.insert(name.clone(), Vec::new()); |
There was a problem hiding this comment.
Both backends do this to some extent, but it's maybe a little worse here. Eagerly registering Vec::new() means we cannot differentiate between a writer that was created but dropped before properly finishing, and a valid writer that just didn't write everything.
It's better to make files Mutex<HashMap<String, Option<Vec<u8>>>>. When we first pass out a file, set the corresponding value to None and only over-write it to Vec when finishing. This lets us differentiate properly between successfully completed files. As a bonus, we can also check in SaveContext::finish that all created files finished correctly and error out if this is not the case.
For the disk backend, we can do the same thing.
| "artifact file name hint {:?} must be a relative file name with no path \ | ||
| separators", | ||
| key, | ||
| ))); |
There was a problem hiding this comment.
This behavior diverges from the disk backend. Can we make name mangling infallible like the disk path?
| return Err(save::Error::message(format!( | ||
| "file {} already exists", | ||
| full.display() | ||
| ))); |
There was a problem hiding this comment.
If a previous save fails for some reason, it will leave around junk which will then make a second round fail. I think defensively not over-writing is the call, but do we want to support some amount of cleanup on failure?
| if version == T::VERSION { | ||
| T::load(object) | ||
| } else { | ||
| T::load_legacy(object) |
There was a problem hiding this comment.
Many of the built-in impls are not tested for round trippability, and this load_legacy path is completely uncovered. Please add tests for the unhappy paths as well and verify that load_legacy works as expected.
| @@ -0,0 +1,14 @@ | |||
| # DiskANN Record | |||
|
|
|||
| This crate provides a small framework for persisting structured Rust values as a | |||
There was a problem hiding this comment.
what are "structured rust values"
| field-by-field plumbing for plain structs. Every record carries a `Version` so loaders | ||
| can detect schema changes and either upgrade or fall back through a probing chain. | ||
|
|
||
| The goal is to allow crates like `diskann` to checkpoint their state without depending on |
There was a problem hiding this comment.
Are manifest and binary artifacts compatible across providers or defined per provider?
| @@ -0,0 +1,453 @@ | |||
| /* | |||
| * Copyright (c) Microsoft Corporation. | |||
There was a problem hiding this comment.
could the primitives in number.rs, value.rs, and version.rs be used from some existing external crate. Is there a strong reason to define this here
harsha-simhadri
left a comment
There was a problem hiding this comment.
couple of comments inline. will read backed code once you have final updates
harsha-simhadri
left a comment
There was a problem hiding this comment.
couple of comments inline. will read backed code once you have final updates
This PR is part 1/2 of #1079 and only introduces the
diskann-recordcrate. See #1079 for sample output when thesave::Saveandload::Loadtraits from this crate are implemented for a simple in-memory index.Reference Issues/PRs
Part 1/2 of #1079. Also see #737.
What does this implement/fix? Briefly explain your changes.
Introduces two new traits (
diskann_record::save::Saveanddiskann_record::load::Load)and a new creatediskann-recordto support file-based serialization + deserialization.Any other comments?