[server][controller] Change defaults to production values by kvargha · Pull Request #2863 · linkedin/venice

kvargha · 2026-06-12T22:22:39Z

Problem Statement

From the server/controller production-defaults audit:

Server Docker/standalone samples override safer OSS code defaults for rolling-restart behavior, so sample-based users diverge from the code-default path.
The controller cluster resource uses the legacy CRUSH-family rebalancer, not WAGED; OSS leaves the Helix WAGED rebalance-preference/capacity unset; and the non-HaaS path (VeniceHelixAdmin) does not apply the WAGED cluster config the HaaS path (ZkHelixAdminClient) sets.

Solution

Server sample defaults

Drop server.promotion.to.leader.replica.delay.seconds=1 and server.netty.graceful.shutdown.period.seconds=0 from the server Docker/standalone samples so they inherit the code defaults (300s / 30s).
Keep the legacy timing overrides only on the Pulsar integration-test harness.

Controller WAGED defaults

Default the Helix WAGED rebalance preference (EVENNESS=10, LESS_MOVEMENT=1, FORCE_BASELINE_CONVERGE=0) and capacity (instance 10000, weight 100) in VeniceControllerClusterConfig instead of leaving them unset.
Assign the controller-cluster resource with WAGED on both paths: VeniceHelixAdmin switches to WagedRebalancer; ZkHelixAdminClient drops the explicit AutoRebalanceStrategy.
Apply the full WAGED cluster config on the non-HaaS path at parity with ZkHelixAdminClient.
Set GLOBAL_REBALANCE_ASYNC_MODE=false on the controller cluster: the non-HaaS controller runs the Helix pipeline in-process and blocks on the resource appearing in the external view at startup, but WAGED computes its baseline async by default, so the first pipeline run is empty and can miss that startup window. Synchronous global rebalance computes the assignment on the first run.
Thread numberOfControllers through the test controller-creation path and cap CONTROLLER_CLUSTER_REPLICA to the controller count, avoiding WAGED FAILED_TO_CALCULATE.

Push-start timeouts

offline.job.start.timeout.ms: 120000 (2 min) -> 960000 (16 min), the production value (controller-side wait for a new version's replicas to be assigned).
ControllerClient.DEFAULT_REQUEST_TIMEOUT_MS: 10 min -> 17 min, above the 16-min controller wait, so request_topic does not expire client-side and retry the non-idempotent create.

No other capacity, hardware, topology, blob-transfer, server-shutdown, or partition-drop defaults change.

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Extended TestVeniceControllerClusterConfig and TestZkHelixAdminClient; validated WAGED controller-cluster convergence via the existing multi-region E2E suite. Full CI passes.

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Server samples inherit the code defaults for promotion delay and graceful shutdown. The controller-cluster resource uses WAGED (production rebalance-preference/capacity, synchronous global rebalance) instead of the legacy CRUSH-family rebalancer. offline.job.start.timeout.ms goes 2 min -> 16 min and the controller client request timeout 10 min -> 17 min.

Copilot

Pull request overview

Aligns OSS sample server/docker configs and controller-cluster Helix resource setup with production/code defaults, so users starting from provided samples don’t unknowingly diverge from the standard operational behavior.

Changes:

Removed sample overrides for server leader-promotion delay and Netty graceful shutdown period so samples inherit existing code defaults (300s and 30s respectively).
Updated controller-cluster resource rebalance strategy for the storage-cluster resource from CRUSH to CRUSHED (CrushEdRebalanceStrategy).
Updated the related comment in VeniceHelixAdmin to reflect the strategy intent.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
services/venice-standalone/config/server.properties	Removes sample overrides so standalone inherits code/prod defaults for promotion delay and graceful shutdown.
docker/venice-server/single-dc-configs/server.properties	Removes docker single-DC sample overrides so defaults match production/code behavior.
docker/venice-server/multi-dc-configs/dc-0/server.properties	Removes docker multi-DC (dc-0) sample overrides so defaults match production/code behavior.
docker/venice-server/multi-dc-configs/dc-1/server.properties	Removes docker multi-DC (dc-1) sample overrides so defaults match production/code behavior.
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java	Switches controller-cluster resource rebalance strategy to CRUSHED for the storage-cluster resource.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Number the clusterProperties locals in testInvalidRebalancePreferenceAndCapacityKeys 1-7 in declaration order after the capacity positivity cases were inserted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Switch the legacy (non-HaaS) controller-cluster resource rebalancer from DelayedAutoRebalancer to WagedRebalancer, matching the HaaS path. WAGED computes an all-or-nothing assignment, so it cannot place a controller-cluster resource that requests more replicas than there are controllers; DelayedAutoRebalancer tolerated that under-replication. The integration-test framework runs 1 controller by default but left controller.cluster.replica at its default of 3, so thread the controller count through VeniceControllerCreateOptions and set CONTROLLER_CLUSTER_REPLICA = min(numberOfControllers, 3), mirroring how kafka.replication.factor is set to match the single test broker. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

controller cluster WAGED computes its baseline assignment asynchronously by default (GLOBAL_REBALANCE_ASYNC_MODE defaults to true), so a freshly added controller-cluster resource gets an empty assignment on the first rebalance pipeline run and only converges after the async baseline finishes and re-triggers the pipeline. The non-HaaS controller runs that pipeline in-process and blocks in waitUntilClusterResourceIsVisibleInEV during startup, so the async convergence can miss the startup window and the controller fails to start. Set GLOBAL_REBALANCE_ASYNC_MODE=false on the controller cluster so WAGED computes the baseline synchronously in the first pipeline run and the external view becomes visible promptly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cluster The non-HaaS controller-cluster creation path only set auto-join, topology-aware=false, and synchronous global rebalance, while the HaaS path additionally applies the global rebalance preference, the capacity config, and persist-best-possible. Mirror the HaaS WAGED cluster config here so both paths assign the controller-cluster resource the same way. Switch from setConfig(HelixConfigScope, Map) to updateClusterConfigs(ClusterConfig) because the structured rebalance-preference and capacity maps cannot be carried by the string-keyed config map. Keep GLOBAL_REBALANCE_ASYNC_MODE=false, which the non-HaaS in-process pipeline needs but the HaaS dedicated controller does not. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…gy config The controller cluster is managed by WAGED, not CRUSH, so the comment claiming the topology/fault-zone fields are used by the CRUSH algorithm is inaccurate. WAGED also consumes these fields (AssignableNode computes the fault zone from the cluster topology config), so reword to describe fault-zone-aware placement for the configured rebalancer in both the non-HaaS (VeniceHelixAdmin) and HaaS (ZkHelixAdminClient) paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tion value Change the default from 120000 (2 min) to 960000 (16 min), matching the value used in production. This is the controller-side ceiling for how long addVersion waits for a new version's replicas to be assigned before failing the push; the longer budget keeps in-flight pushes alive across transient storage-node disruption such as rolling restarts. Update the adjacent comment, which referenced a stale 3-minute client timeout, to describe the actual coupling with the controller client's per-request timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ve the controller wait The version-creation (request_topic) call can block on the controller for up to offline.job.start.timeout.ms while addVersion waits for the new version's replicas to be assigned. The client request timeout was hardcoded at 10 min (DEFAULT_REQUEST_TIMEOUT_MS), so once that controller-side wait was raised to 16 min, a slow assignment made the client time out first; the inner ControllerClient retry loop then re-issued the non-idempotent request_topic, creating extra abandoned versions. Add a configurable controller.request.topic.timeout.ms (default 20 min, above the 16-min controller wait) and thread it from the VPJ to a new requestTopicForWrites overload, so the controller always responds (with success or its own error) before the client times out. Other controller calls keep the existing 10-min default. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ble, above the" This reverts commit 4fe3090.

…tart wait The request_topic (version creation) call blocks on the controller for up to offline.job.start.timeout.ms (16 min) while addVersion waits for the new version's replicas to be assigned. The client per-request timeout was 10 min, so a slow assignment made the client time out first and retry the non-idempotent create. Raise DEFAULT_REQUEST_TIMEOUT_MS to 17 min so the controller responds before the client gives up. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

VeniceControllerClusterConfig now always provides a non-null rebalance preference and capacity config (production defaults), so the null-guards around setGlobalRebalancePreference and the capacity setters in both the non-HaaS (VeniceHelixAdmin) and HaaS (ZkHelixAdminClient) controller-cluster setup are dead code. Remove them. Delete the TestZkHelixAdminClient cases that fed null/partial preference/capacity through the mock (now-impossible states; the config defaults and validation are covered by TestVeniceControllerClusterConfig), and stub the defaults in setUp so the remaining real-method tests reflect the non-null contract. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

TestHAASController The controller-cluster WAGED config setters no longer guard against a null rebalance preference or capacity config, because VeniceControllerClusterConfig always provides them from production defaults. TestHAASController's testCloudConfig and testConcurrentClusterInitialization mock VeniceControllerMultiClusterConfig, whose unstubbed getters returned null and previously relied on the removed guards. Stub both getters non-null to mirror the production contract so createVeniceControllerCluster completes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The implicit CONTROLLER_CLUSTER_REPLICA override derived the replica count from options.getNumberOfControllers() capped at 3. A cluster built with numberOfControllers(0) that later adds a controller would set the replica count to 0, producing an invalid Helix IdealState (replicas=0, minActiveReplicas=1) and WAGED rebalance failures. Clamp the lower bound to 1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings June 12, 2026 22:22

Copilot started reviewing on behalf of kvargha June 12, 2026 22:23 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 13, 2026 00:01

Copilot started reviewing on behalf of kvargha June 13, 2026 00:01 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

kvargha marked this pull request as draft June 13, 2026 00:13

kvargha force-pushed the kvargha/server-production-defaults branch from 29fd31c to 4cff363 Compare June 13, 2026 04:44

kvargha marked this pull request as ready for review June 13, 2026 05:48

Copilot AI review requested due to automatic review settings June 13, 2026 05:49

Copilot started reviewing on behalf of kvargha June 13, 2026 05:49 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

Comment thread tests/venice-pulsar-test/src/pulsarIntegrationTest/resources/docker-compose-pulsar-venice.yaml Outdated

Comment thread services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java Outdated

kvargha force-pushed the kvargha/server-production-defaults branch from 4cff363 to 3a9cdab Compare June 15, 2026 18:02

Copilot AI review requested due to automatic review settings June 15, 2026 18:03

kvargha force-pushed the kvargha/server-production-defaults branch from 3a9cdab to ba85d80 Compare June 15, 2026 18:03

Copilot started reviewing on behalf of kvargha June 15, 2026 18:03 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread tests/venice-pulsar-test/src/pulsarIntegrationTest/resources/docker-compose-pulsar-venice.yaml

Comment thread services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java Outdated

kvargha force-pushed the kvargha/server-production-defaults branch 2 times, most recently from fa27b0b to d75fb7e Compare June 15, 2026 18:43

Copilot AI review requested due to automatic review settings June 15, 2026 18:43

Copilot started reviewing on behalf of kvargha June 15, 2026 18:43 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread tests/venice-pulsar-test/src/pulsarIntegrationTest/resources/docker-compose-pulsar-venice.yaml

Comment thread services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java Outdated

kvargha force-pushed the kvargha/server-production-defaults branch 2 times, most recently from 73344c8 to 48f75af Compare June 15, 2026 22:07

Copilot AI review requested due to automatic review settings June 15, 2026 22:07

Copilot started reviewing on behalf of kvargha June 15, 2026 22:08 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread tests/venice-pulsar-test/src/pulsarIntegrationTest/resources/docker-compose-pulsar-venice.yaml

Comment thread services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java

kvargha marked this pull request as draft June 15, 2026 22:17

kvargha force-pushed the kvargha/server-production-defaults branch 2 times, most recently from 9e8c6c6 to 9ccf2cd Compare June 15, 2026 22:42

kvargha and others added 6 commits June 16, 2026 15:29

Retrigger CI (flaky VeniceServerTest.testVeniceServerWithD2 shard)

8338cbf

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kvargha changed the title ~~[server] Change defaults to production values~~ [server][controller] Change defaults to production values; WAGED for the non-HaaS controller cluster Jun 18, 2026

kvargha changed the title ~~[server][controller] Change defaults to production values; WAGED for the non-HaaS controller cluster~~ [server][controller] Change defaults to production values Jun 18, 2026

kvargha and others added 4 commits June 18, 2026 17:37

Revert "[vpj][common] Make the request_topic client timeout configura…

306b995

…ble, above the" This reverts commit 4fe3090.

kvargha marked this pull request as ready for review June 19, 2026 02:39

Copilot AI review requested due to automatic review settings June 19, 2026 02:39

Copilot started reviewing on behalf of kvargha June 19, 2026 02:40 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread internal/venice-common/src/main/java/com/linkedin/venice/controllerapi/ControllerClient.java Outdated

Potential fix for pull request finding

a93fda6

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 19, 2026 02:44

kvargha enabled auto-merge (squash) June 19, 2026 02:44

Copilot started reviewing on behalf of kvargha June 19, 2026 02:44 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java Outdated

Comment thread services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java Outdated

kvargha requested a review from Copilot June 19, 2026 03:13

Copilot started reviewing on behalf of kvargha June 19, 2026 03:13 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread .../src/integrationTest/java/com/linkedin/venice/integration/utils/VeniceControllerWrapper.java Outdated

kvargha and others added 2 commits June 18, 2026 20:38

Copilot AI review requested due to automatic review settings June 19, 2026 04:29

Copilot started reviewing on behalf of kvargha June 19, 2026 04:30 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java

Conversation

kvargha commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kvargha commented Jun 12, 2026 •

edited

Loading