Skip to content

Cosmos: Use regional endpoint fallback for data-plane operations#4503

Open
simorenoh wants to merge 13 commits into
mainfrom
issue4487-regional-fallback
Open

Cosmos: Use regional endpoint fallback for data-plane operations#4503
simorenoh wants to merge 13 commits into
mainfrom
issue4487-regional-fallback

Conversation

@simorenoh
Copy link
Copy Markdown
Member

@simorenoh simorenoh commented Jun 1, 2026

When all regional endpoints are excluded or unavailable, the SDK now falls back to the hub/primary write region endpoint instead of the global account endpoint for data-plane and non-account metadata operations.

This aligns with the behavior of other Azure Cosmos DB SDKs and ensures data-plane traffic always routes through regional endpoints.

Fallback behavior by account type:

  • Single-write accounts: When all regional endpoints are excluded, data-plane requests fall back to the hub (primary write region) endpoint. This is the only write region and is always present in preferred_write_endpoints.
  • Multi-write accounts: When all regional endpoints are excluded, data-plane requests fall back to the first entry in preferred_write_endpoints, which is the user's most-preferred write region based on their RoutingStrategy configuration.

In both cases, the global account endpoint is never used for data-plane operations — it is reserved exclusively for metadata operations such as account topology discovery.

Changes:

  • resolve_endpoint() fallback now branches on is_dataplane: data-plane ops fall back to preferred_write_endpoints[0], metadata ops fall back to default_endpoint
  • Added CosmosEndpoint::is_global() helper
  • Added debug_assert! to catch data-plane requests that resolve to the global endpoint
  • Renamed existing resolve_endpoint_falls_back_to_global_when_all_excluded test to clarify it covers metadata behavior
  • Added unit tests for data-plane hub fallback with all regions excluded
  • Added multi-write fault injection test asserting no data-plane traffic reaches the global endpoint

Fixes #4487

@simorenoh simorenoh requested a review from a team as a code owner June 1, 2026 15:20
Copilot AI review requested due to automatic review settings June 1, 2026 15:20
@github-actions github-actions Bot added the Cosmos The azure_cosmos crate label Jun 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Cosmos DB endpoint resolution to ensure data-plane operations fall back to the hub/primary write region endpoint (a regional endpoint) when all regional endpoints are excluded/unavailable, while metadata operations continue to fall back to the global account endpoint. This aligns routing behavior with other Cosmos SDKs and addresses #4487.

Changes:

  • Updated driver endpoint selection fallback behavior based on whether the operation is data-plane vs metadata, and added CosmosEndpoint::is_global() plus a debug_assert! invariant for data-plane routing.
  • Added a new public routing configuration option RoutingStrategy::PreferredRegions(Vec<Region>) and refactored SDK builder logic to translate routing strategies into driver preferred regions.
  • Added/updated unit + multi-write fault-injection tests to validate hub fallback and ensure no data-plane traffic uses the global endpoint.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
sdk/cosmos/azure_data_cosmos/tests/multi_write_tests/cosmos_multi_write_fault_injection.rs Adds a multi-write test asserting excluded-region reads fall back to hub and avoid global endpoint.
sdk/cosmos/azure_data_cosmos/src/routing_strategy.rs Adds RoutingStrategy::PreferredRegions and documents selection behavior.
sdk/cosmos/azure_data_cosmos/src/options/mod.rs Removes now-unused internal application_region from client options.
sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client.rs Minor formatting-only change.
sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client_builder.rs Refactors builder to convert RoutingStrategy into driver options; adds unit tests for strategy conversion.
sdk/cosmos/azure_data_cosmos/CHANGELOG.md Adds an unreleased changelog entry for the new routing strategy variant.
sdk/cosmos/azure_data_cosmos_driver/src/driver/routing/endpoint.rs Adds CosmosEndpoint::is_global() helper.
sdk/cosmos/azure_data_cosmos_driver/src/driver/pipeline/operation_pipeline.rs Implements data-plane vs metadata fallback behavior and expands unit tests for the new routing behavior.

Comment thread sdk/cosmos/azure_data_cosmos/CHANGELOG.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/src/routing_strategy.rs
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/pipeline/operation_pipeline.rs Outdated
simorenoh added a commit that referenced this pull request Jun 1, 2026
Data-plane ops fall back to the hub/write region when
all regional endpoints are excluded. Metadata ops fall
back to the global account endpoint. Aligns with #4503.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread sdk/cosmos/.cspell.json Outdated
@simorenoh simorenoh requested a review from ananth7592 June 2, 2026 19:05
Copy link
Copy Markdown
Member

@ananth7592 ananth7592 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When all regional endpoints are excluded, data-plane requests fall back to the first entry in preferred_write_endpoints

Would be good to explain motivation and reasoning for this - IMO there are also some reasons why we should still use the Hub region instead of most preferred write region - mostly because ultimately MultiMaster can survive some period of time without Hub Region but not prolonged - and sticking to Hub region is the safest bet in general - also, RU charges for writes in Hub region are lower than in any sattelite region - so, risk is lower.

Metadata operations (e.g., account topology discovery) continue to use the global endpoint as before

The only call that should for sure go to the global (or whatever customer provided) endpoint is the bootstrapping - very first account topology update.
For subsequent metadata operations i could see some reasons to stick to global endpint specifically for account topology reads - because this increases teh chances of actually hitting the hub region - which is the only place with guaranteed fress topology
For other metdata oeprations (!= Account) I would argue that like for document operations we should never use the global endpoint

let requests = response.diagnostics().requests();
for req in requests.iter() {
assert!(
req.region().is_some(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also assert that this is actually the hub region endpoint?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

.await?;

// Ensure replication to the read region before excluding it.
let _ = run_context
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't fail or assert that the replication actually happens though, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that's right - deleted this in favor of the in-memory emulator and checking there

// valid write endpoint is the hub, so the SDK must fall back to
// that regional endpoint and never the global one.
let mut operation = OperationOptions::default();
operation.excluded_regions =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just so that I understand it, we always fall back to the hub region when we run out of regions to try, even if the user explicitly excludes it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the idea - passing in the default endpoint in theory does the same thing but is handled differently by the service. regional endpoints are behind different compute nodes that have access to ATM, while the default endpoint does not.

/// Validates the fix for issue #4487.
#[tokio::test]
#[cfg_attr(
not(test_category = "multi_region"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you have to add a new test config and only do this in live tests - in_memory_emulator based tests should be perfectly sufficient for this level of validation (you are purely focused on SDK/driver behavior)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, switched to in-memory emulator

.await?;

// Ensure replication to both regions.
let _ = run_context
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is that ensuring replication to both regions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wasn't - removed this and changed to in-memory emulator

.create_item(&pk, &item_id, &item, None)
.await?;

// Ensure replication to the read region before excluding it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that ensuring replication - could very well return 404?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above


let requests = response.diagnostics().requests();
assert!(
!requests.is_empty(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as for SDK -level tests - also not quite sure we need these tests on both levels (driver and SDK)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, generally we've focused on just having integration tests in the SDK, since that should provide coverage over the driver code paths.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, should just be driver, will delete these

..OperationOptions::default()
};

let response = context
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito - same comments as in correspodning SDK test - do we need tests for SDK and driver?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto above

Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simorenoh
Copy link
Copy Markdown
Member Author

Heads up on the current red CI: the cargo package step is failing on azure_data_cosmos with

failed to select a version for the requirement `azure_core_test = "^0.2.0"`
candidate versions found which didn't match: 0.1.0

This is not caused by this PR — it's a workspace-wide release-engineering gap:

  • [workspace.dependencies.azure_core_test] in the root Cargo.toml declares version = "0.2.0" (bumped on main by 5059571c2 + b7bed92f8).
  • azure_data_cosmos picks it up as a dev-dep via workspace = true.
  • When cargo package runs on the published azure_data_cosmos crate it strips the workspace path and tries to resolve ^0.2.0 from crates.io, which still only has 0.1.0.
  • main has the identical declaration, so every PR currently hits this — reproducible locally with cargo package --locked --package azure_data_cosmos.

It'll go green once azure_core_test 0.2.0 is published to crates.io. Nothing to do on this branch for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Cosmos: fallback to primary endpoint as opposed to default endpoint

6 participants