Cosmos: Use regional endpoint fallback for data-plane operations#4503
Cosmos: Use regional endpoint fallback for data-plane operations#4503simorenoh wants to merge 13 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates Cosmos DB endpoint resolution to ensure data-plane operations fall back to the hub/primary write region endpoint (a regional endpoint) when all regional endpoints are excluded/unavailable, while metadata operations continue to fall back to the global account endpoint. This aligns routing behavior with other Cosmos SDKs and addresses #4487.
Changes:
- Updated driver endpoint selection fallback behavior based on whether the operation is data-plane vs metadata, and added
CosmosEndpoint::is_global()plus adebug_assert!invariant for data-plane routing. - Added a new public routing configuration option
RoutingStrategy::PreferredRegions(Vec<Region>)and refactored SDK builder logic to translate routing strategies into driver preferred regions. - Added/updated unit + multi-write fault-injection tests to validate hub fallback and ensure no data-plane traffic uses the global endpoint.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure_data_cosmos/tests/multi_write_tests/cosmos_multi_write_fault_injection.rs | Adds a multi-write test asserting excluded-region reads fall back to hub and avoid global endpoint. |
| sdk/cosmos/azure_data_cosmos/src/routing_strategy.rs | Adds RoutingStrategy::PreferredRegions and documents selection behavior. |
| sdk/cosmos/azure_data_cosmos/src/options/mod.rs | Removes now-unused internal application_region from client options. |
| sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client.rs | Minor formatting-only change. |
| sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client_builder.rs | Refactors builder to convert RoutingStrategy into driver options; adds unit tests for strategy conversion. |
| sdk/cosmos/azure_data_cosmos/CHANGELOG.md | Adds an unreleased changelog entry for the new routing strategy variant. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/routing/endpoint.rs | Adds CosmosEndpoint::is_global() helper. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/pipeline/operation_pipeline.rs | Implements data-plane vs metadata fallback behavior and expands unit tests for the new routing behavior. |
Data-plane ops fall back to the hub/write region when all regional endpoints are excluded. Metadata ops fall back to the global account endpoint. Aligns with #4503. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
When all regional endpoints are excluded, data-plane requests fall back to the first entry in preferred_write_endpoints
Would be good to explain motivation and reasoning for this - IMO there are also some reasons why we should still use the Hub region instead of most preferred write region - mostly because ultimately MultiMaster can survive some period of time without Hub Region but not prolonged - and sticking to Hub region is the safest bet in general - also, RU charges for writes in Hub region are lower than in any sattelite region - so, risk is lower.
Metadata operations (e.g., account topology discovery) continue to use the global endpoint as before
The only call that should for sure go to the global (or whatever customer provided) endpoint is the bootstrapping - very first account topology update.
For subsequent metadata operations i could see some reasons to stick to global endpint specifically for account topology reads - because this increases teh chances of actually hitting the hub region - which is the only place with guaranteed fress topology
For other metdata oeprations (!= Account) I would argue that like for document operations we should never use the global endpoint
| let requests = response.diagnostics().requests(); | ||
| for req in requests.iter() { | ||
| assert!( | ||
| req.region().is_some(), |
There was a problem hiding this comment.
Also assert that this is actually the hub region endpoint?
| .await?; | ||
|
|
||
| // Ensure replication to the read region before excluding it. | ||
| let _ = run_context |
There was a problem hiding this comment.
This doesn't fail or assert that the replication actually happens though, right?
There was a problem hiding this comment.
yep that's right - deleted this in favor of the in-memory emulator and checking there
| // valid write endpoint is the hub, so the SDK must fall back to | ||
| // that regional endpoint and never the global one. | ||
| let mut operation = OperationOptions::default(); | ||
| operation.excluded_regions = |
There was a problem hiding this comment.
So, just so that I understand it, we always fall back to the hub region when we run out of regions to try, even if the user explicitly excludes it?
There was a problem hiding this comment.
yes, that's the idea - passing in the default endpoint in theory does the same thing but is handled differently by the service. regional endpoints are behind different compute nodes that have access to ATM, while the default endpoint does not.
| /// Validates the fix for issue #4487. | ||
| #[tokio::test] | ||
| #[cfg_attr( | ||
| not(test_category = "multi_region"), |
There was a problem hiding this comment.
I don't think you have to add a new test config and only do this in live tests - in_memory_emulator based tests should be perfectly sufficient for this level of validation (you are purely focused on SDK/driver behavior)
There was a problem hiding this comment.
done, switched to in-memory emulator
| .await?; | ||
|
|
||
| // Ensure replication to both regions. | ||
| let _ = run_context |
There was a problem hiding this comment.
how is that ensuring replication to both regions?
There was a problem hiding this comment.
it wasn't - removed this and changed to in-memory emulator
| .create_item(&pk, &item_id, &item, None) | ||
| .await?; | ||
|
|
||
| // Ensure replication to the read region before excluding it. |
There was a problem hiding this comment.
How is that ensuring replication - could very well return 404?
|
|
||
| let requests = response.diagnostics().requests(); | ||
| assert!( | ||
| !requests.is_empty(), |
There was a problem hiding this comment.
Same comments as for SDK -level tests - also not quite sure we need these tests on both levels (driver and SDK)
There was a problem hiding this comment.
True, generally we've focused on just having integration tests in the SDK, since that should provide coverage over the driver code paths.
There was a problem hiding this comment.
yep, should just be driver, will delete these
| ..OperationOptions::default() | ||
| }; | ||
|
|
||
| let response = context |
There was a problem hiding this comment.
dito - same comments as in correspodning SDK test - do we need tests for SDK and driver?
|
Heads up on the current red CI: the This is not caused by this PR — it's a workspace-wide release-engineering gap:
It'll go green once |
When all regional endpoints are excluded or unavailable, the SDK now falls back to the hub/primary write region endpoint instead of the global account endpoint for data-plane and non-account metadata operations.
This aligns with the behavior of other Azure Cosmos DB SDKs and ensures data-plane traffic always routes through regional endpoints.
Fallback behavior by account type:
In both cases, the global account endpoint is never used for data-plane operations — it is reserved exclusively for metadata operations such as account topology discovery.
Changes:
Fixes #4487