Skip to content

[server][controller] Change defaults to production values#2863

Open
kvargha wants to merge 15 commits into
linkedin:mainfrom
kvargha:kvargha/server-production-defaults
Open

[server][controller] Change defaults to production values#2863
kvargha wants to merge 15 commits into
linkedin:mainfrom
kvargha:kvargha/server-production-defaults

Conversation

@kvargha

@kvargha kvargha commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem Statement

From the server/controller production-defaults audit:

  1. Server Docker/standalone samples override safer OSS code defaults for rolling-restart behavior, so sample-based users diverge from the code-default path.
  2. The controller cluster resource uses the legacy CRUSH-family rebalancer, not WAGED; OSS leaves the Helix WAGED rebalance-preference/capacity unset; and the non-HaaS path (VeniceHelixAdmin) does not apply the WAGED cluster config the HaaS path (ZkHelixAdminClient) sets.

Solution

Server sample defaults

  • Drop server.promotion.to.leader.replica.delay.seconds=1 and server.netty.graceful.shutdown.period.seconds=0 from the server Docker/standalone samples so they inherit the code defaults (300s / 30s).
  • Keep the legacy timing overrides only on the Pulsar integration-test harness.

Controller WAGED defaults

  • Default the Helix WAGED rebalance preference (EVENNESS=10, LESS_MOVEMENT=1, FORCE_BASELINE_CONVERGE=0) and capacity (instance 10000, weight 100) in VeniceControllerClusterConfig instead of leaving them unset.
  • Assign the controller-cluster resource with WAGED on both paths: VeniceHelixAdmin switches to WagedRebalancer; ZkHelixAdminClient drops the explicit AutoRebalanceStrategy.
  • Apply the full WAGED cluster config on the non-HaaS path at parity with ZkHelixAdminClient.
  • Set GLOBAL_REBALANCE_ASYNC_MODE=false on the controller cluster: the non-HaaS controller runs the Helix pipeline in-process and blocks on the resource appearing in the external view at startup, but WAGED computes its baseline async by default, so the first pipeline run is empty and can miss that startup window. Synchronous global rebalance computes the assignment on the first run.
  • Thread numberOfControllers through the test controller-creation path and cap CONTROLLER_CLUSTER_REPLICA to the controller count, avoiding WAGED FAILED_TO_CALCULATE.

Push-start timeouts

  • offline.job.start.timeout.ms: 120000 (2 min) -> 960000 (16 min), the production value (controller-side wait for a new version's replicas to be assigned).
  • ControllerClient.DEFAULT_REQUEST_TIMEOUT_MS: 10 min -> 17 min, above the 16-min controller wait, so request_topic does not expire client-side and retry the non-idempotent create.

No other capacity, hardware, topology, blob-transfer, server-shutdown, or partition-drop defaults change.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Extended TestVeniceControllerClusterConfig and TestZkHelixAdminClient; validated WAGED controller-cluster convergence via the existing multi-region E2E suite. Full CI passes.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Server samples inherit the code defaults for promotion delay and graceful shutdown. The controller-cluster resource uses WAGED (production rebalance-preference/capacity, synchronous global rebalance) instead of the legacy CRUSH-family rebalancer. offline.job.start.timeout.ms goes 2 min -> 16 min and the controller client request timeout 10 min -> 17 min.

Copilot AI review requested due to automatic review settings June 12, 2026 22:22

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Aligns OSS sample server/docker configs and controller-cluster Helix resource setup with production/code defaults, so users starting from provided samples don’t unknowingly diverge from the standard operational behavior.

Changes:

  • Removed sample overrides for server leader-promotion delay and Netty graceful shutdown period so samples inherit existing code defaults (300s and 30s respectively).
  • Updated controller-cluster resource rebalance strategy for the storage-cluster resource from CRUSH to CRUSHED (CrushEdRebalanceStrategy).
  • Updated the related comment in VeniceHelixAdmin to reflect the strategy intent.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
services/venice-standalone/config/server.properties Removes sample overrides so standalone inherits code/prod defaults for promotion delay and graceful shutdown.
docker/venice-server/single-dc-configs/server.properties Removes docker single-DC sample overrides so defaults match production/code behavior.
docker/venice-server/multi-dc-configs/dc-0/server.properties Removes docker multi-DC (dc-0) sample overrides so defaults match production/code behavior.
docker/venice-server/multi-dc-configs/dc-1/server.properties Removes docker multi-DC (dc-1) sample overrides so defaults match production/code behavior.
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java Switches controller-cluster resource rebalance strategy to CRUSHED for the storage-cluster resource.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings June 13, 2026 00:01

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

@kvargha kvargha marked this pull request as draft June 13, 2026 00:13
@kvargha kvargha force-pushed the kvargha/server-production-defaults branch from 29fd31c to 4cff363 Compare June 13, 2026 04:44
@kvargha kvargha marked this pull request as ready for review June 13, 2026 05:48
Copilot AI review requested due to automatic review settings June 13, 2026 05:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

@kvargha kvargha force-pushed the kvargha/server-production-defaults branch from 4cff363 to 3a9cdab Compare June 15, 2026 18:02
Copilot AI review requested due to automatic review settings June 15, 2026 18:03
@kvargha kvargha force-pushed the kvargha/server-production-defaults branch from 3a9cdab to ba85d80 Compare June 15, 2026 18:03

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

@kvargha kvargha force-pushed the kvargha/server-production-defaults branch 2 times, most recently from fa27b0b to d75fb7e Compare June 15, 2026 18:43
Copilot AI review requested due to automatic review settings June 15, 2026 18:43

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

@kvargha kvargha force-pushed the kvargha/server-production-defaults branch 2 times, most recently from 73344c8 to 48f75af Compare June 15, 2026 22:07
Copilot AI review requested due to automatic review settings June 15, 2026 22:07

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

@kvargha kvargha marked this pull request as draft June 15, 2026 22:17
@kvargha kvargha force-pushed the kvargha/server-production-defaults branch 2 times, most recently from 9e8c6c6 to 9ccf2cd Compare June 15, 2026 22:42
kvargha and others added 6 commits June 16, 2026 15:29
Number the clusterProperties locals in
testInvalidRebalancePreferenceAndCapacityKeys 1-7 in declaration order after
the capacity positivity cases were inserted.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch the legacy (non-HaaS) controller-cluster resource rebalancer from
DelayedAutoRebalancer to WagedRebalancer, matching the HaaS path.

WAGED computes an all-or-nothing assignment, so it cannot place a
controller-cluster resource that requests more replicas than there are
controllers; DelayedAutoRebalancer tolerated that under-replication. The
integration-test framework runs 1 controller by default but left
controller.cluster.replica at its default of 3, so thread the controller
count through VeniceControllerCreateOptions and set
CONTROLLER_CLUSTER_REPLICA = min(numberOfControllers, 3), mirroring how
kafka.replication.factor is set to match the single test broker.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
controller cluster

WAGED computes its baseline assignment asynchronously by default
(GLOBAL_REBALANCE_ASYNC_MODE defaults to true), so a freshly added
controller-cluster resource gets an empty assignment on the first rebalance
pipeline run and only converges after the async baseline finishes and
re-triggers the pipeline. The non-HaaS controller runs that pipeline
in-process and blocks in waitUntilClusterResourceIsVisibleInEV during
startup, so the async convergence can miss the startup window and the
controller fails to start. Set GLOBAL_REBALANCE_ASYNC_MODE=false on the
controller cluster so WAGED computes the baseline synchronously in the first
pipeline run and the external view becomes visible promptly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cluster

The non-HaaS controller-cluster creation path only set auto-join,
topology-aware=false,
and synchronous global rebalance, while the HaaS path additionally applies
the global
rebalance preference, the capacity config, and persist-best-possible. Mirror
the HaaS
WAGED cluster config here so both paths assign the controller-cluster
resource the same
way. Switch from setConfig(HelixConfigScope, Map) to
updateClusterConfigs(ClusterConfig)
because the structured rebalance-preference and capacity maps cannot be
carried by the
string-keyed config map. Keep GLOBAL_REBALANCE_ASYNC_MODE=false, which the
non-HaaS
in-process pipeline needs but the HaaS dedicated controller does not.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gy config

The controller cluster is managed by WAGED, not CRUSH, so the comment
claiming the
topology/fault-zone fields are used by the CRUSH algorithm is inaccurate.
WAGED also
consumes these fields (AssignableNode computes the fault zone from the
cluster topology
config), so reword to describe fault-zone-aware placement for the configured
rebalancer
in both the non-HaaS (VeniceHelixAdmin) and HaaS (ZkHelixAdminClient) paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kvargha kvargha changed the title [server] Change defaults to production values [server][controller] Change defaults to production values; WAGED for the non-HaaS controller cluster Jun 18, 2026
@kvargha kvargha changed the title [server][controller] Change defaults to production values; WAGED for the non-HaaS controller cluster [server][controller] Change defaults to production values Jun 18, 2026
kvargha and others added 4 commits June 18, 2026 17:37
…tion

value

Change the default from 120000 (2 min) to 960000 (16 min), matching the value
used in
production. This is the controller-side ceiling for how long addVersion waits
for a new
version's replicas to be assigned before failing the push; the longer budget
keeps in-flight
pushes alive across transient storage-node disruption such as rolling
restarts. Update the
adjacent comment, which referenced a stale 3-minute client timeout, to
describe the actual
coupling with the controller client's per-request timeout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ve the

controller wait

The version-creation (request_topic) call can block on the controller for up
to
offline.job.start.timeout.ms while addVersion waits for the new version's
replicas to be
assigned. The client request timeout was hardcoded at 10 min
(DEFAULT_REQUEST_TIMEOUT_MS),
so once that controller-side wait was raised to 16 min, a slow assignment
made the client
time out first; the inner ControllerClient retry loop then re-issued the
non-idempotent
request_topic, creating extra abandoned versions.

Add a configurable controller.request.topic.timeout.ms (default 20 min, above
the 16-min
controller wait) and thread it from the VPJ to a new requestTopicForWrites
overload, so the
controller always responds (with success or its own error) before the client
times out.
Other controller calls keep the existing 10-min default.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tart wait

The request_topic (version creation) call blocks on the controller for up to
offline.job.start.timeout.ms (16 min) while addVersion waits for the new
version's replicas
to be assigned. The client per-request timeout was 10 min, so a slow
assignment made the
client time out first and retry the non-idempotent create. Raise
DEFAULT_REQUEST_TIMEOUT_MS
to 17 min so the controller responds before the client gives up.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kvargha kvargha marked this pull request as ready for review June 19, 2026 02:39
Copilot AI review requested due to automatic review settings June 19, 2026 02:39

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 02:44
@kvargha kvargha enabled auto-merge (squash) June 19, 2026 02:44

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

VeniceControllerClusterConfig now always provides a non-null rebalance
preference and
capacity config (production defaults), so the null-guards around
setGlobalRebalancePreference
and the capacity setters in both the non-HaaS (VeniceHelixAdmin) and HaaS
(ZkHelixAdminClient)
controller-cluster setup are dead code. Remove them. Delete the
TestZkHelixAdminClient cases
that fed null/partial preference/capacity through the mock (now-impossible
states; the config
defaults and validation are covered by TestVeniceControllerClusterConfig),
and stub the
defaults in setUp so the remaining real-method tests reflect the non-null
contract.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

kvargha and others added 2 commits June 18, 2026 20:38
TestHAASController

The controller-cluster WAGED config setters no longer guard against a null
rebalance
preference or capacity config, because VeniceControllerClusterConfig always
provides them
from production defaults. TestHAASController's testCloudConfig and
testConcurrentClusterInitialization mock VeniceControllerMultiClusterConfig,
whose unstubbed
getters returned null and previously relied on the removed guards. Stub both
getters non-null
to mirror the production contract so createVeniceControllerCluster completes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The implicit CONTROLLER_CLUSTER_REPLICA override derived the replica count
from
options.getNumberOfControllers() capped at 3. A cluster built with
numberOfControllers(0)
that later adds a controller would set the replica count to 0, producing an
invalid Helix
IdealState (replicas=0, minActiveReplicas=1) and WAGED rebalance failures.
Clamp the lower
bound to 1.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 04:29

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants