Skip to content

[controller] Consolidate Helix-admin clients into one owner (ZK client re-architecture, PR 1/2)#2872

Open
namithanivead wants to merge 6 commits into
linkedin:mainfrom
namithanivead:nvijayak/zk-client-reorg-helix-vs-venice
Open

[controller] Consolidate Helix-admin clients into one owner (ZK client re-architecture, PR 1/2)#2872
namithanivead wants to merge 6 commits into
linkedin:mainfrom
namithanivead:nvijayak/zk-client-reorg-helix-vs-venice

Conversation

@namithanivead

@namithanivead namithanivead commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR is independently mergeable and behaviour-preserving. It is step 1 of a multi-PR re-architecture of the controller's ZooKeeper clients.
Following the review direction on #2848. ZK in the controller is used for two distinct systems:

  1. Helix cluster metadata & runtime coordination (owned by Helix APIs) — IdealState, ExternalView, LiveInstances, InstanceConfig, ClusterConfig, StateModelDefs, etc.
  2. Venice metadata & controller state (owned by Venice code) — Stores, Schemas, StoreConfig, OfflinePushStatus, Execution IDs, StoreGraveyard, Personas, etc.

The end goal (a later PR) is to point the System #2 client at a separate ZK ensemble for backup/HA. That requires first cleanly separating the two systems' clients — this PR is the first step.

No ZK-address change and no behaviour change in this PR. Pure structural consolidation.

What this PR does (Commits 1–4 + tests)

VeniceHelixAdmin held its own ZKHelixAdmin (admin) on zookeeper.address, duplicating the helixAdmin inside ZkHelixAdminClient (same ZK, same work). This PR routes every admin.* call through the single helixAdminClient and deletes the duplicate:

  1. Route controller-cluster creation through helixAdminClient.createVeniceControllerCluster().
  2. Route legacy storage-cluster creation through a new HelixAdminClient#createVeniceStorageClusterLegacy(clusterName) (verbatim relocation, adminhelixAdmin, preserving the non-HAAS DelayedAutoRebalancer + CrushRebalanceStrategy config).
  3. Route isClusterValidisVeniceStorageClusterCreated, and add HelixAdminClient#setupCustomizedStateConfig(clusterName).
  4. Delete the duplicate admin field, its ZkClient construction, and admin.close(). getHelixAdmin() (raw-Helix test seam) now delegates to HelixAdminClient#getHelixAdmin().
  5. Unit tests for createVeniceStorageClusterLegacy.

After this PR, VeniceHelixAdmin holds exactly three ZK-touching fields with clear ownership: helixAdminClient (all Helix admin, System #1), helixManager (live leader-election session, System #1), and zkClient (Venice metadata, System #2).

Behaviour delta (intentional, benign)

The non-HAAS controller-cluster path now sets persistBestPossibleAssignment=true (via the shared createVeniceControllerCluster), which the old inline code did not. Matches the HAAS path; benign for stateless controllers.

Testing

Test Coverage Status
TestZkHelixAdminClient (13 unit tests, incl. 2 new) new client methods, routing, close, legacy create ✅ Pass
TestHAASController (11 integration tests) HAAS controller/storage-cluster-leader paths ✅ Pass
TestVeniceHelixAdminWithSharedEnvironment#testAddVersionWhenClusterInMaintenanceMode non-HAAS path + getHelixAdmin() seam ✅ Pass

:services:venice-controller:compileJava, :internal:venice-test-common:compileIntegrationTestJava, and spotless all pass.


Follow-up PRs (separate, not blocking this one)

These complete the System #1 / System #2 separation. They are independent follow-ups — this PR does not depend on them and is safe to merge first.

Commit 5 — Purify the System #2 (zkClient) of Helix-data reads

The Venice-metadata zkClient is still (mis)used for a few System #1 (Helix) reads. These must move to the Helix-owned side before the later HA PR can repoint zkClient at another ensemble (otherwise those Helix reads would follow it to the wrong ZK). All exist on main today, unchanged by this PR:

  • HelixLiveInstanceMonitor constructed on the Venice zkClient but watches Helix LIVEINSTANCESVeniceHelixAdmin.java:769.
  • ExternalView reads via raw zkClientzkClient.exists("/<cluster>/EXTERNALVIEW/<resource>") and zkClient.getChildren("/<cluster>/EXTERNALVIEW") at VeniceHelixAdmin.java:1085 and :1090.
  • Helix managers derived from zkClient.getServers()VeniceControllerStateModel.java:256 and HelixVeniceClusterResources.java:169.

Commit 6 — Make the System #2 boundary structural (explicit metadata client)

Introduce an explicit Venice-metadata client wrapper so the ~50–100 metadata accessors (StoreConfig, Schemas, StoreGraveyard, ExecutionId, OfflinePushStatus, Personas, AdminTopicMetadata, …) across VeniceHelixAdmin, HelixVeniceClusterResources, and VeniceParentHelixAdmin go through one clearly-owned client — turning the #1/#2 split from convention into structure.

Later — the actual HA change

Add a config (e.g. venice.metadata.zk.address) that points only the System #2 client at a separate/backup ZK ensemble. Depends on Commit 5.

🤖 Generated with Claude Code

nvijayak and others added 6 commits June 17, 2026 12:38
Replace the inline admin.* controller-cluster creation in
createControllerClusterIfRequired() with a delegation to
helixAdminClient.createVeniceControllerCluster(), so all Helix-admin
operations on the controller cluster go through the single
ZkHelixAdminClient instead of the duplicate VeniceHelixAdmin.admin.

Step 1 of organizing the controller's ZK clients into one Helix-admin
owner (System linkedin#1) and one Venice-metadata owner (System linkedin#2). No ZK
address change. createVeniceControllerCluster() is a strict superset of
the removed inline logic (adds persistBestPossibleAssignment + retry,
already used by the HAAS path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Client

Add HelixAdminClient#createVeniceStorageClusterLegacy(clusterName) and
delegate createClusterIfRequired() to it. The new method is a verbatim
relocation of the inline non-HAAS storage-cluster setup (cluster creation
with the same properties + LeaderStandby state model, then controller-
cluster resource registration via DelayedAutoRebalancer +
CrushRebalanceStrategy), with admin -> helixAdmin.

Step 2 of consolidating all Helix-admin operations behind the single
ZkHelixAdminClient (System linkedin#1). Removes the now-unused
CONTROLLER_CLUSTER_NUMBER_OF_PARTITION constant and three now-unused
rebalancer imports from VeniceHelixAdmin. No ZK-address change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
helixAdminClient

Add HelixAdminClient#setupCustomizedStateConfig(clusterName) and route
both remaining storage-Helix reads off the duplicate VeniceHelixAdmin.admin:
  - isClusterValid() -> helixAdminClient.isVeniceStorageClusterCreated()
    (identical getClusters().contains(...) operation)
  - HelixUtils.setupCustomizedStateConfig(admin, ...) ->
    helixAdminClient.setupCustomizedStateConfig(...) (uses helixAdmin)

Step 3 of consolidating Helix-admin ops behind ZkHelixAdminClient. After
this, admin is referenced only by its own declaration/construction/close,
which the next commit removes. No ZK-address change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The VeniceHelixAdmin.admin field was a second ZKHelixAdmin connected to
the same storage ZK as ZkHelixAdminClient's helixAdmin, doing the same
category of work. After Commits 1-3 routed every admin.* operation through
helixAdminClient, the field is removed entirely:
  - drop the field, its dedicated ZkClient construction, and admin.close()
  - getHelixAdmin() (a raw-Helix test/maintenance seam) now delegates to
    HelixAdminClient#getHelixAdmin(), which returns the single helixAdmin
  - remove now-unused ZKHelixAdmin / ZNRecordSerializer imports

Step 4 (final) of consolidating all Helix-admin operations behind one
ZkHelixAdminClient (System linkedin#1). No ZK-address change. main +
integration-test sources compile; getHelixAdmin() callers in
TestHAASController / TestVeniceHelixAdminWithSharedEnvironment unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover the new HelixAdminClient#createVeniceStorageClusterLegacy method in
TestZkHelixAdminClient:
  - happy path: creates the storage cluster + LeaderStandby state model and
    registers it as a controller-cluster resource with DelayedAutoRebalancer
    + CrushRebalanceStrategy (the legacy non-HAAS rebalancer config)
  - early-return when the cluster already exists (no addCluster/addResource)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reduce the interface javadoc to the contract (legacy/non-HAAS, creates +
registers as a controller-cluster resource, no-op if it already exists),
matching the concise style of the sibling methods. Implementation details
(rebalancer classes, the VeniceHelixAdmin consolidation history) belong in
the impl/commit history, not the interface contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant