Benchmarking json schemas by JordanMaples · Pull Request #1154 · microsoft/DiskANN

JordanMaples · 2026-06-12T18:42:06Z

To increase clarity into what the available options are for our json inputs, Mark and I discussed walking the ASTs of the json body using Schemar and rendering a breakdown of the possible options. This adds JSON Schema documentation for benchmark inputs using the --schema and --field options

Summary

Adds schemars-based JSON Schema generation and a custom tree-style terminal renderer so users can discover benchmark input fields without reading source code.

Usage

Full schema for an input type

cargo run --release -p diskann-benchmark -- inputs graph-index-build --schema

Drill into a specific field

cargo run --release -p diskann-benchmark -- inputs graph-index-build --field source.start_point_strategy

What's included

schemars 1.2 added to workspace; JsonSchema derived on all input types
Custom renderer (diskann-benchmark-runner/src/schema.rs) with: - Colored terminal output (bold field names, cyan types, yellow enum variants)
Multi-line description alignment
Handles internally/externally-tagged enums and $ref newtypes
MAX_DEPTH guard against recursive schemas
Manual JsonSchema impls for custom-serde types: - NonNegativeFinite (number with minimum)
StartPointStrategyRef (externally-tagged enum, with drift test)
QuantizationTypeSchema proxy (keeps schemars out of diskann-disk)
JsonSchema bound added to Input::Raw trait
README documentation for the new --schema and --field options

Sample output

Full schema

cargo run -p diskann-benchmark --all-features -- inputs --schema graph-index-build-bftree-spherical-quantization

Schema for "graph-index-build-bftree-spherical-quantization":

├── build: object
    ├── alpha: number
    │
    ├── backedge_ratio: number
    │
    ├── data: string
    │   # A file that is used as an input to for a benchmark.
    │
    ├── data_type: one of ["float64", "float32", "float16", "uint8", "uint16", "uint32", "uint64", "int8", "int16", "int32", "int64", "bool"]
    │   # An enum representation for common DiskANN data types.
    │
    ├── distance: one of ["squared_l2", "inner_product", "cosine", "cosine_normalized"]
    │
    ├── insert_retry (optional): (any of)
        ├── num_insert_attempts: integer (≥1)
        │
        ├── retry_threshold: number
        │
        └── saturate_inserts: boolean
    │
    ├── l_build: integer (≥0)
    │
    ├── max_degree: integer (≥0)
    │
    ├── multi_insert (optional): (any of)
        ├── batch_parallelism: integer (≥1)
        │
        ├── batch_size: integer (≥1)
        │
        └── intra_batch_candidates: (one of)
            # A one-to-one correspondence with [`diskann::index::config::IntraBatchCandidates`].
            ├─ "none" — No intra-batch candidates will be considered.
            ├─ "max"
            └─ "all" — Consider all elements in the batch for intra-batch candidates.
    │
    ├── num_threads: integer (≥0)
    │
    ├── save_path: any (optional)
    │
    └── start_point_strategy: (one of)
        # Strategy for selecting graph start points.
        ├─ "medoid" — Use the medoid as the starting point.
        ├─ "first_vector" — Use the first vector in the dataset.
        ├─ "random_vectors" — Randomly select vector(s) with given norm.
        │  ├── norm: number
        │  ├── nsamples: integer (≥1)
        │  └── seed: integer
        ├─ "random_samples" — Sample data from the dataset.
        │  ├── nsamples: integer (≥1)
        │  └── seed: integer
        └─ "latin_hyper_cube" — Use Latin Hypercube sampling.
           ├── nsamples: integer (≥1)
           └── seed: integer
│
├── neighbor_store_config (optional): (any of)
    ├── cache_only: any (optional)
    │   # If true, only use the in-memory circular buffer (no disk pages).
    │
    ├── cb_copy_on_access_ratio: any (optional)
    │   # Ratio of buffer used before copy-on-access kicks in.
    │
    ├── cb_max_record_size: any (optional)
    │   # Maximum record size that can be stored in the circular buffer.
    │
    ├── cb_min_record_size: any (optional)
    │   # Minimum record size for the circular buffer.
    │
    ├── cb_size_byte: integer (≥0)
    │   # Size of the circular buffer (in-memory write cache) in bytes.
    │
    ├── leaf_page_size: integer (≥0)
    │   # Size of leaf pages in bytes.
    │
    ├── read_promotion_rate: any (optional)
    │   # Probability (0-100) of promoting a read record to the front of the buffer.
    │
    ├── read_record_cache: any (optional)
    │   # Whether to cache full pages on read.
    │
    └── scan_promotion_rate: any (optional)
        # Probability (0-100) of promoting a scanned record to the front of the buffer.
│
├── num_bits: integer (≥1)
│
├── pre_scale (optional): (any of)
    ├─ one of ["none", "reciprocal_mean_norm"]
    └─ "some"
│
├── quant_store_config (optional): (any of)
    ├── cache_only: any (optional)
    │   # If true, only use the in-memory circular buffer (no disk pages).
    │
    ├── cb_copy_on_access_ratio: any (optional)
    │   # Ratio of buffer used before copy-on-access kicks in.
    │
    ├── cb_max_record_size: any (optional)
    │   # Maximum record size that can be stored in the circular buffer.
    │
    ├── cb_min_record_size: any (optional)
    │   # Minimum record size for the circular buffer.
    │
    ├── cb_size_byte: integer (≥0)
    │   # Size of the circular buffer (in-memory write cache) in bytes.
    │
    ├── leaf_page_size: integer (≥0)
    │   # Size of leaf pages in bytes.
    │
    ├── read_promotion_rate: any (optional)
    │   # Probability (0-100) of promoting a read record to the front of the buffer.
    │
    ├── read_record_cache: any (optional)
    │   # Whether to cache full pages on read.
    │
    └── scan_promotion_rate: any (optional)
        # Probability (0-100) of promoting a scanned record to the front of the buffer.
│
├── search_phase: (one of)
    ├─ "topk"
    │  ├── groundtruth: string
    │  ├── num_threads: array of integer (≥1)
    │  ├── queries: string
    │  ├── reps: integer (≥1)
    │  └── runs: array of any
    ├─ "range"
    │  ├── groundtruth: string
    │  ├── num_threads: array of integer (≥1)
    │  ├── queries: string
    │  ├── reps: integer (≥1)
    │  └── runs: array of any
    ├─ "topk-beta-filter"
    │  ├── beta: number
    │  ├── data_labels: string
    │  ├── groundtruth: string
    │  ├── num_threads: array of integer (≥1)
    │  ├── queries: string
    │  ├── query_predicates: string
    │  ├── reps: integer (≥1)
    │  └── runs: array of any
    ├─ "topk-multihop-filter"
    │  ├── data_labels: string
    │  ├── groundtruth: string
    │  ├── num_threads: array of integer (≥1)
    │  ├── queries: string
    │  ├── query_predicates: string
    │  ├── reps: integer (≥1)
    │  └── runs: array of any
    └─ "topk-inline-filter"
       ├── adaptive_l: (union) (optional)
       ├── data_labels: string
       ├── groundtruth: string
       ├── num_threads: array of integer (≥1)
       ├── queries: string
       ├── query_predicates: string
       ├── reps: integer (≥1)
       └── runs: array of any
│
├── seed: integer (≥0)
│
├── transform_kind: (one of)
    ├─ one of ["null"]
    ├─ "padding_hadamard"
    ├─ "random_rotation"
    └─ "double_hadamard"
│
└── vector_store_config (optional): (any of)
    ├── cache_only: any (optional)
    │   # If true, only use the in-memory circular buffer (no disk pages).
    │
    ├── cb_copy_on_access_ratio: any (optional)
    │   # Ratio of buffer used before copy-on-access kicks in.
    │
    ├── cb_max_record_size: any (optional)
    │   # Maximum record size that can be stored in the circular buffer.
    │
    ├── cb_min_record_size: any (optional)
    │   # Minimum record size for the circular buffer.
    │
    ├── cb_size_byte: integer (≥0)
    │   # Size of the circular buffer (in-memory write cache) in bytes.
    │
    ├── leaf_page_size: integer (≥0)
    │   # Size of leaf pages in bytes.
    │
    ├── read_promotion_rate: any (optional)
    │   # Probability (0-100) of promoting a read record to the front of the buffer.
    │
    ├── read_record_cache: any (optional)
    │   # Whether to cache full pages on read.
    │
    └── scan_promotion_rate: any (optional)
        # Probability (0-100) of promoting a scanned record to the front of the buffer.

Example:

{
  "content": {
    "build": {
      "alpha": 1.2000000476837158,
      "backedge_ratio": 1.0,
      "data": "path/to/data",
      "data_type": "float32",
      "distance": "squared_l2",
      "insert_retry": null,
      "l_build": 50,
      "max_degree": 32,
      "multi_insert": {
        "batch_parallelism": 32,
        "batch_size": 128,
        "intra_batch_candidates": "none"
      },
      "num_threads": 1,
      "save_path": null,
      "start_point_strategy": "medoid"
    },
    "neighbor_store_config": null,
    "num_bits": 1,
    "pre_scale": null,
    "quant_store_config": null,
    "search_phase": {
      "groundtruth": "path/to/groundtruth",
      "num_threads": [
        1,
        2,
        4,
        8
      ],
      "queries": "path/to/queries",
      "reps": 5,
      "runs": [
        {
          "recall_k": 10,
          "search_l": [
            10,
            20,
            30,
            40
          ],
          "search_n": 10
        }
      ],
      "search-type": "topk"
    },
    "seed": 42,
    "transform_kind": "null",
    "vector_store_config": null
  },
  "type": "graph-index-build-bftree-spherical-quantization"
}

Single Field

cargo run -p diskann-benchmark --all-features -- inputs --schema graph-index-build-bftree-spherical-quantization --field build.start_point_strategy


Schema for "graph-index-build-bftree-spherical-quantization".build.start_point_strategy:

├─ "medoid" — Use the medoid as the starting point.
├─ "first_vector" — Use the first vector in the dataset.
├─ "random_vectors" — Randomly select vector(s) with given norm.
│  ├── norm: number
│  ├── nsamples: integer (≥1)
│  └── seed: integer
├─ "random_samples" — Sample data from the dataset.
│  ├── nsamples: integer (≥1)
│  └── seed: integer
└─ "latin_hyper_cube" — Use Latin Hypercube sampling.
   ├── nsamples: integer (≥1)
   └── seed: integer

Example:

"medoid"

Copilot

Pull request overview

This PR adds JSON Schema generation (via schemars) and a tree-style terminal renderer so diskann-benchmark users can discover benchmark input fields/variants using --schema and --field, rather than reading source.

Changes:

Add schemars::JsonSchema coverage across benchmark input/tolerance DTOs and generate per-input JSON Schemas from Input::Raw.
Introduce diskann-benchmark-runner::schema to render schemas (and drill into sub-fields) as human-readable CLI documentation.
Add schema/serialization drift tests and wire new CLI flags + README documentation.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
diskann-disk/src/build/configuration/quantization_types.rs	Adds a test intended to guard drift across `QuantizationType` variants/serialization.
diskann-disk/Cargo.toml	Adds `serde_json` as a dev-dependency for new tests.
diskann-benchmark/src/utils/mod.rs	Derives `JsonSchema` for `SimilarityMeasure` used in inputs.
diskann-benchmark/src/inputs/multi_vector.rs	Derives `JsonSchema` for multi-vector input types.
diskann-benchmark/src/inputs/graph_index.rs	Derives `JsonSchema` broadly; adds manual `JsonSchema` for `StartPointStrategyRef` + drift test; annotates schema override for the remote-serde field.
diskann-benchmark/src/inputs/filters.rs	Derives `JsonSchema` for filter-related inputs.
diskann-benchmark/src/inputs/exhaustive.rs	Derives `JsonSchema` for exhaustive-benchmark inputs.
diskann-benchmark/src/inputs/disk.rs	Adds schema proxy for `QuantizationType` (to avoid schemars dependency in `diskann-disk`) and derives `JsonSchema` for disk-index inputs.
diskann-benchmark/src/inputs/bftree.rs	Derives `JsonSchema` for bf_tree inputs.
diskann-benchmark/src/backend/multi_vector/driver.rs	Derives `JsonSchema` for multi-vector tolerance input.
diskann-benchmark/src/backend/disk_index/benchmarks.rs	Derives `JsonSchema` for disk-index tolerance input.
diskann-benchmark/README.md	Documents `--schema` and `--field` usage.
diskann-benchmark/Cargo.toml	Adds `schemars` dependency for input schema generation.
diskann-benchmark-simd/src/lib.rs	Derives `JsonSchema` for SIMD input/tolerance types.
diskann-benchmark-simd/Cargo.toml	Adds `schemars` dependency.
diskann-benchmark-runner/src/utils/num.rs	Implements `JsonSchema` for `NonNegativeFinite`.
diskann-benchmark-runner/src/utils/datatype.rs	Derives `JsonSchema` for `DataType`.
diskann-benchmark-runner/src/test/typed.rs	Updates test inputs/tolerances to derive `JsonSchema`.
diskann-benchmark-runner/src/test/dim.rs	Updates test inputs/tolerances to derive `JsonSchema`.
diskann-benchmark-runner/src/schema.rs	Adds schema renderer + path resolver + unit tests.
diskann-benchmark-runner/src/lib.rs	Exposes the new `schema` module.
diskann-benchmark-runner/src/input.rs	Requires `Input::Raw: JsonSchema` and adds `Registered::schema()` plumbing.
diskann-benchmark-runner/src/files.rs	Derives `JsonSchema` for `InputFile`.
diskann-benchmark-runner/src/app.rs	Adds `--schema`/`--field` CLI flags and wiring to render schema docs + example.
diskann-benchmark-runner/Cargo.toml	Adds `colored` + `schemars` dependencies.
Cargo.toml	Adds `schemars` to workspace dependencies.
Cargo.lock	Locks new `schemars`/`colored` (and transitive) dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    /// Ensures the manual `JsonSchema` impl stays in sync with actual variants.
+    /// If a variant is added to `QuantizationType`, this match will fail to compile.
+    #[test]
+    fn schema_covers_all_quantization_variants() {


+            }
+            s
+        }
+        Some("number") => "number".to_string(),


+            let generator =
+                schemars::generate::SchemaSettings::default().into_generator();
+            let schema = generator.into_root_schema_for::<T::Raw>();
+            serde_json::to_value(schema).unwrap_or_default()


codecov-commenter · 2026-06-12T18:58:53Z

Codecov Report

❌ Patch coverage is 65.10574% with 231 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.77%. Comparing base (b5ebac2) to head (14c1318).

Files with missing lines	Patch %	Lines
diskann-benchmark-runner/src/schema.rs	74.74%	124 Missing ⚠️
diskann-benchmark/src/inputs/graph_index.rs	41.66%	49 Missing ⚠️
diskann-benchmark-runner/src/app.rs	20.93%	34 Missing ⚠️
diskann-benchmark-runner/src/utils/num.rs	0.00%	9 Missing ⚠️
diskann-benchmark-runner/src/input.rs	0.00%	8 Missing ⚠️
diskann-benchmark/src/inputs/post_processor.rs	0.00%	6 Missing ⚠️
...disk/src/build/configuration/quantization_types.rs	95.23%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1154      +/-   ##
==========================================
- Coverage   90.99%   89.77%   -1.22%     
==========================================
  Files         489      490       +1     
  Lines       93130    93785     +655     
==========================================
- Hits        84746    84199     -547     
- Misses       8384     9586    +1202

Flag	Coverage Δ
miri	`89.77% <65.10%> (-1.22%)`	⬇️
unittests	`89.43% <65.10%> (-1.53%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-benchmark-runner/src/files.rs	`100.00% <ø> (ø)`
diskann-benchmark-runner/src/test/dim.rs	`89.79% <ø> (ø)`
diskann-benchmark-runner/src/test/typed.rs	`97.10% <ø> (ø)`
diskann-benchmark-runner/src/utils/datatype.rs	`100.00% <ø> (ø)`
diskann-benchmark-simd/src/lib.rs	`83.03% <ø> (ø)`
diskann-benchmark/src/inputs/disk.rs	`1.50% <ø> (ø)`
diskann-benchmark/src/inputs/exhaustive.rs	`26.83% <ø> (ø)`
diskann-benchmark/src/inputs/filters.rs	`67.74% <ø> (ø)`
diskann-benchmark/src/inputs/multi_vector.rs	`19.67% <ø> (ø)`
diskann-benchmark/src/utils/mod.rs	`83.33% <ø> (ø)`
... and 7 more

... and 40 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Adds schemars-based JSON Schema generation and a custom tree-style terminal renderer for benchmark input types. Users can run `inputs <name> --schema` to see field documentation with types, optionality, enum variants, and descriptions — followed by the example JSON. Implementation: - Add schemars 1.2 to workspace; derive JsonSchema on all input types - Custom renderer in diskann-benchmark-runner/src/schema.rs with: - Colored output (field names bold, types cyan, variants yellow) - Multi-line description alignment - Handles internally/externally-tagged enums, newtypes with $ref - MAX_DEPTH guard against recursive schemas - Manual JsonSchema impls for custom-serde types: - NonNegativeFinite (number with minimum) - StartPointStrategyRef (externally-tagged enum, with drift test) - QuantizationTypeSchema proxy (keeps schemars out of diskann-disk) - JsonSchema bound added to Input::Raw trait Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

hildebrandmw

Thanks Jordan. I'm very supportive of a feature along these lines. Especially when enums are involved, obtaining the set of valid representations is currently an exercise in guess-and-check.

That said, there are several aspects of this particular approach that make me hesitant:

Testing: diskann-benchmark-runner has a pretty extensive UX test suite for comparing app input and output. This makes it easier to write tests, observe the expected output, and monitor changes that get made and this PR does not use this methodology. As such, many paths in the schema renderer are uncovered by the checked-in tests.
The example representation for enums seems sub-par with only a single variant displayed. I don't think this is fixable with schemars alone.
The hand-written schemas (e.g. StartPointStrategyRef) worry from a maintenance stand-point. The tests don't actually protect against drift here.
The renderer is fairly inscrutable, doesn't display nesting separators particularly well, and would at the very least benefit greatly from UX tests.

This is a good first step, but I would like to see several things ironed out first:

Enable examples for all enum variants. This may require changes to the how Input works - that's fine.
Address the potential for schema drift. There are some options here. When the type is a mirror (e.g. StartPointStrategyRef), derive JsonSchema on the mirror directly. In any case, please add tests that validate an input's example type matches its generated schema.
Use the UX test framework.
Please revisit the style of the renderer. For example, render_node has an unused _is_last argument.

- Clarify quantization schema test docs: the JsonSchema impl lives in the QuantizationTypeSchema proxy (diskann-benchmark), not in diskann-disk - Add minimum: 0 to u64 seed fields in StartPointStrategyRef schema - Render min/max constraints for number types in type_summary, matching integers - Fail loudly if RootSchema serialization fails instead of defaulting silently Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

JordanMaples requested review from a team and Copilot June 12, 2026 18:42

Copilot started reviewing on behalf of JordanMaples June 12, 2026 18:42 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

JordanMaples force-pushed the jordanmaples/benchmark_schema branch from 83b159e to f764227 Compare June 15, 2026 20:38

JordanMaples and others added 4 commits June 22, 2026 08:56

Document --schema and --field CLI options in benchmark README

4934ad1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

remove lifetime

f2fc0e6

fmt

14c1318

JordanMaples force-pushed the jordanmaples/benchmark_schema branch from f764227 to 14c1318 Compare June 22, 2026 16:01

hildebrandmw reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarking json schemas#1154

Benchmarking json schemas#1154
JordanMaples wants to merge 5 commits into
mainfrom
jordanmaples/benchmark_schema

JordanMaples commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 12, 2026 •

edited

Loading

Uh oh!

hildebrandmw left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

JordanMaples commented Jun 12, 2026

Summary

Usage

Full schema for an input type

Drill into a specific field

What's included

Sample output

Full schema

Single Field

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 12, 2026 •

edited

Loading